high performance processor: Topics by Science.gov

Sample records for high performance processor

A high performance linear equation solver on the VPP500 parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi

1994-12-31

This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Managing Power Heterogeneity

NASA Astrophysics Data System (ADS)

Pruhs, Kirk

A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.
Applications considerations in the system design of highly concurrent multiprocessors

NASA Technical Reports Server (NTRS)

Lundstrom, Stephen F.

1987-01-01

A flow model processor approach to parallel processing is described, using very-high-performance individual processors, high-speed circuit switched interconnection networks, and a high-speed synchronization capability to minimize the effect of the inherently serial portions of applications on performance. Design studies related to the determination of the number of processors, the memory organization, and the structure of the networks used to interconnect the processor and memory resources are discussed. Simulations indicate that applications centered on the large shared data memory should be able to sustain over 500 million floating point operations per second.
Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Stanford Hardware Development Program

NASA Technical Reports Server (NTRS)

Peterson, A.; Linscott, I.; Burr, J.

1986-01-01

Architectures for high performance, digital signal processing, particularly for high resolution, wide band spectrum analysis were developed. These developments are intended to provide instrumentation for NASA's Search for Extraterrestrial Intelligence (SETI) program. The real time signal processing is both formal and experimental. The efficient organization and optimal scheduling of signal processing algorithms were investigated. The work is complemented by efforts in processor architecture design and implementation. A high resolution, multichannel spectrometer that incorporates special purpose microcoded signal processors is being tested. A general purpose signal processor for the data from the multichannel spectrometer was designed to function as the processing element in a highly concurrent machine. The processor performance required for the spectrometer is in the range of 1000 to 10,000 million instructions per second (MIPS). Multiple node processor configurations, where each node performs at 100 MIPS, are sought. The nodes are microprogrammable and are interconnected through a network with high bandwidth for neighboring nodes, and medium bandwidth for nodes at larger distance. The implementation of both the current mutlichannel spectrometer and the signal processor as Very Large Scale Integration CMOS chip sets was commenced.
A High Performance VLSI Computer Architecture For Computer Graphics

NASA Astrophysics Data System (ADS)

Chin, Chi-Yuan; Lin, Wen-Tai

1988-10-01

A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
Implementing An Image Understanding System Architecture Using Pipe

NASA Astrophysics Data System (ADS)

Luck, Randall L.

1988-03-01

This paper will describe PIPE and how it can be used to implement an image understanding system. Image understanding is the process of developing a description of an image in order to make decisions about its contents. The tasks of image understanding are generally split into low level vision and high level vision. Low level vision is performed by PIPE -a high performance parallel processor with an architecture specifically designed for processing video images at up to 60 fields per second. High level vision is performed by one of several types of serial or parallel computers - depending on the application. An additional processor called ISMAP performs the conversion from iconic image space to symbolic feature space. ISMAP plugs into one of PIPE's slots and is memory mapped into the high level processor. Thus it forms the high speed link between the low and high level vision processors. The mechanisms for bottom-up, data driven processing and top-down, model driven processing are discussed.
High Performance Processors for Space Environments: A Subproject of the NASA Exploration Missions Systems Directorate "Radiation Hardened Electronics for Space Environments" Technology Development Program

NASA Technical Reports Server (NTRS)

Johnson, M.; Label, K.; McCabe, J.; Powell, W.; Bolotin, G.; Kolawa, E.; Ng, T.; Hyde, D.

2007-01-01

Implementation of challenging Exploration Systems Missions Directorate objectives and strategies can be constrained by onboard computing capabilities and power efficiencies. The Radiation Hardened Electronics for Space Environments (RHESE) High Performance Processors for Space Environments project will address this challenge by significantly advancing the sustained throughput and processing efficiency of high-per$ormance radiation-hardened processors, targeting delivery of products by the end of FY12.
DFT algorithms for bit-serial GaAs array processor architectures

NASA Technical Reports Server (NTRS)

Mcmillan, Gary B.

1988-01-01

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Hot Chips and Hot Interconnects for High End Computing Systems

NASA Technical Reports Server (NTRS)

Saini, Subhash

2005-01-01

I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
Fault tolerant, radiation hard, high performance digital signal processor

NASA Technical Reports Server (NTRS)

Holmann, Edgar; Linscott, Ivan R.; Maurer, Michael J.; Tyler, G. L.; Libby, Vibeke

1990-01-01

An architecture has been developed for a high-performance VLSI digital signal processor that is highly reliable, fault-tolerant, and radiation-hard. The signal processor, part of a spacecraft receiver designed to support uplink radio science experiments at the outer planets, organizes the connections between redundant arithmetic resources, register files, and memory through a shuffle exchange communication network. The configuration of the network and the state of the processor resources are all under microprogram control, which both maps the resources according to algorithmic needs and reconfigures the processing should a failure occur. In addition, the microprogram is reloadable through the uplink to accommodate changes in the science objectives throughout the course of the mission. The processor will be implemented with silicon compiler tools, and its design will be verified through silicon compilation simulation at all levels from the resources to full functionality. By blending reconfiguration with redundancy the processor implementation is fault-tolerant and reliable, and possesses the long expected lifetime needed for a spacecraft mission to the outer planets.
WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
Processors for wavelet analysis and synthesis: NIFS and TI-C80 MVP

NASA Astrophysics Data System (ADS)

Brooks, Geoffrey W.

1996-03-01

Two processors are considered for image quadrature mirror filtering (QMF). The neuromorphic infrared focal-plane sensor (NIFS) is an existing prototype analog processor offering high speed spatio-temporal Gaussian filtering, which could be used for the QMF low- pass function, and difference of Gaussian filtering, which could be used for the QMF high- pass function. Although not designed specifically for wavelet analysis, the biologically- inspired system accomplishes the most computationally intensive part of QMF processing. The Texas Instruments (TI) TMS320C80 Multimedia Video Processor (MVP) is a 32-bit RISC master processor with four advanced digital signal processors (DSPs) on a single chip. Algorithm partitioning, memory management and other issues are considered for optimal performance. This paper presents these considerations with simulated results leading to processor implementation of high-speed QMF analysis and synthesis.
Compute Server Performance Results

NASA Technical Reports Server (NTRS)

Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)

1994-01-01

Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
Experience in highly parallel processing using DAP

NASA Technical Reports Server (NTRS)

Parkinson, D.

1987-01-01

Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.
Implementing direct, spatially isolated problems on transputer networks

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

Parametric studies were performed on transputer networks of up to 40 processors to determine how to implement and maximize the performance of the solution of problems where no processor-to-processor data transfer is required for the problem solution (spatially isolated). Two types of problems are investigated a computationally intensive problem where the solution required the transmission of 160 bytes of data through the parallel network, and a communication intensive example that required the transmission of 3 Mbytes of data through the network. This data consists of solutions being sent back to the host processor and not intermediate results for another processor to work on. Studies were performed on both integer and floating-point transputers. The latter features an on-chip floating-point math unit and offers approximately an order of magnitude performance increase over the integer transputer on real valued computations. The results indicate that a minimum amount of work is required on each node per communication to achieve high network speedups (efficiencies). The floating-point processor requires approximately an order of magnitude more work per communication than the integer processor because of the floating-point unit's increased computing capacity.
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

PubMed

Cheung, Kit; Schultz, Simon R; Luk, Wayne

2015-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

PubMed Central

Cheung, Kit; Schultz, Simon R.; Luk, Wayne

2016-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.

FPGA Acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.

PubMed

Zierke, Stephanie; Bakos, Jason D

2010-04-12

Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA's on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors. We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10x speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture. Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs).
Federal Plan for High-End Computing. Report of the High-End Computing Revitalization Task Force (HECRTF)

DTIC Science & Technology

2004-07-01

steadily for the past fifteen years, while memory latency and bandwidth have improved much more slowly. For example, Intel processor clock rates38 have... processor and memory performance) all greatly restrict the ability to achieve high levels of performance for science, engineering, and national...sub-nuclear distances. Guide experiments to identify transition from quantum chromodynamics to quark -gluon plasma. Accelerator Physics Accurate
Optical Associative Processors For Visual Perception"

NASA Astrophysics Data System (ADS)

Casasent, David; Telfer, Brian

1988-05-01

We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.
Enabling Future Robotic Missions with Multicore Processors

NASA Technical Reports Server (NTRS)

Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

2011-01-01

Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.
Ultra-Reliable Digital Avionics (URDA) processor

NASA Astrophysics Data System (ADS)

Branstetter, Reagan; Ruszczyk, William; Miville, Frank

1994-10-01

Texas Instruments Incorporated (TI) developed the URDA processor design under contract with the U.S. Air Force Wright Laboratory and the U.S. Army Night Vision and Electro-Sensors Directorate. TI's approach couples advanced packaging solutions with advanced integrated circuit (IC) technology to provide a high-performance (200 MIPS/800 MFLOPS) modular avionics processor module for a wide range of avionics applications. TI's processor design integrates two Ada-programmable, URDA basic processor modules (BPM's) with a JIAWG-compatible PiBus and TMBus on a single F-22 common integrated processor-compatible form-factor SEM-E avionics card. A separate, high-speed (25-MWord/second 32-bit word) input/output bus is provided for sensor data. Each BPM provides a peak throughput of 100 MIPS scalar concurrent with 400-MFLOPS vector processing in a removable multichip module (MCM) mounted to a liquid-flowthrough (LFT) core and interfacing to a processor interface module printed wiring board (PWB). Commercial RISC technology coupled with TI's advanced bipolar complementary metal oxide semiconductor (BiCMOS) application specific integrated circuit (ASIC) and silicon-on-silicon packaging technologies are used to achieve the high performance in a miniaturized package. A Mips R4000-family reduced instruction set computer (RISC) processor and a TI 100-MHz BiCMOS vector coprocessor (VCP) ASIC provide, respectively, the 100 MIPS of a scalar processor throughput and 400 MFLOPS of vector processing throughput for each BPM. The TI Aladdim ASIC chipset was developed on the TI Aladdin Program under contract with the U.S. Army Communications and Electronics Command and was sponsored by the Advanced Research Projects Agency with technical direction from the U.S. Army Night Vision and Electro-Sensors Directorate.
MAP3D: a media processor approach for high-end 3D graphics

NASA Astrophysics Data System (ADS)

Darsa, Lucia; Stadnicki, Steven; Basoglu, Chris

1999-12-01

Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.
A novel VLSI processor architecture for supercomputing arrays

NASA Technical Reports Server (NTRS)

Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.

1993-01-01

Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.
System Level RBDO for Military Ground Vehicles using High Performance Computing

DTIC Science & Technology

2008-01-01

platform. Only the analyses that required more than 24 processors were conducted on the Onyx 350 due to the limited number of processors on the...optimization constraints varied. The queues set the number of processors and number of finite element code licenses available to the analyses. sgi ONYX ...3900: unix 24 MIPS R16000 PROCESSORS 4 IR2 GRAPHICS PIPES 4 IR3 GRAPHICS PIPES 24 GBYTES MEMORY 36 GBYTES LOCAL DISK SPACE sgi ONYX 350: unix 32 MIPS
Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

NASA Technical Reports Server (NTRS)

Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

2003-01-01

Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
[Improving speech comprehension using a new cochlear implant speech processor].

PubMed

Müller-Deile, J; Kortmann, T; Hoppe, U; Hessel, H; Morsnowski, A

2009-06-01

The aim of this multicenter clinical field study was to assess the benefits of the new Freedom 24 sound processor for cochlear implant (CI) users implanted with the Nucleus 24 cochlear implant system. The study included 48 postlingually profoundly deaf experienced CI users who demonstrated speech comprehension performance with their current speech processor on the Oldenburg sentence test (OLSA) in quiet conditions of at least 80% correct scores and who were able to perform adaptive speech threshold testing using the OLSA in noisy conditions. Following baseline measures of speech comprehension performance with their current speech processor, subjects were upgraded to the Freedom 24 speech processor. After a take-home trial period of at least 2 weeks, subject performance was evaluated by measuring the speech reception threshold with the Freiburg multisyllabic word test and speech intelligibility with the Freiburg monosyllabic word test at 50 dB and 70 dB in the sound field. The results demonstrated highly significant benefits for speech comprehension with the new speech processor. Significant benefits for speech comprehension were also demonstrated with the new speech processor when tested in competing background noise.In contrast, use of the Abbreviated Profile of Hearing Aid Benefit (APHAB) did not prove to be a suitably sensitive assessment tool for comparative subjective self-assessment of hearing benefits with each processor. Use of the preprocessing algorithm known as adaptive dynamic range optimization (ADRO) in the Freedom 24 led to additional improvements over the standard upgrade map for speech comprehension in quiet and showed equivalent performance in noise. Through use of the preprocessing beam-forming algorithm BEAM, subjects demonstrated a highly significant improved signal-to-noise ratio for speech comprehension thresholds (i.e., signal-to-noise ratio for 50% speech comprehension scores) when tested with an adaptive procedure using the Oldenburg sentences in the clinical setting S(0)N(CI), with speech signal at 0 degrees and noise lateral to the CI at 90 degrees . With the convincing findings from our evaluations of this multicenter study cohort, a trial with the Freedom 24 sound processor for all suitable CI users is recommended. For evaluating the benefits of a new processor, the comparative assessment paradigm used in our study design would be considered ideal for use with individual patients.
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
First Results of an “Artificial Retina” Processor Prototype

DOE PAGES

Cenci, Riccardo; Bedeschi, Franco; Marino, Pietro; ...

2016-11-15

We report on the performance of a specialized processor capable of reconstructing charged particle tracks in a realistic LHC silicon tracker detector, at the same speed of the readout and with sub-microsecond latency. The processor is based on an innovative pattern-recognition algorithm, called “artificial retina algorithm”, inspired from the vision system of mammals. A prototype of the processor has been designed, simulated, and implemented on Tel62 boards equipped with high-bandwidth Altera Stratix III FPGA devices. Also, the prototype is the first step towards a real-time track reconstruction device aimed at processing complex events of high-luminosity LHC experiments at 40 MHzmore » crossing rate.« less
First Results of an “Artificial Retina” Processor Prototype

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cenci, Riccardo; Bedeschi, Franco; Marino, Pietro

We report on the performance of a specialized processor capable of reconstructing charged particle tracks in a realistic LHC silicon tracker detector, at the same speed of the readout and with sub-microsecond latency. The processor is based on an innovative pattern-recognition algorithm, called “artificial retina algorithm”, inspired from the vision system of mammals. A prototype of the processor has been designed, simulated, and implemented on Tel62 boards equipped with high-bandwidth Altera Stratix III FPGA devices. Also, the prototype is the first step towards a real-time track reconstruction device aimed at processing complex events of high-luminosity LHC experiments at 40 MHzmore » crossing rate.« less
A fully reconfigurable photonic integrated signal processor

NASA Astrophysics Data System (ADS)

Liu, Weilin; Li, Ming; Guzzon, Robert S.; Norberg, Erik J.; Parker, John S.; Lu, Mingzhi; Coldren, Larry A.; Yao, Jianping

2016-03-01

Photonic signal processing has been considered a solution to overcome the inherent electronic speed limitations. Over the past few years, an impressive range of photonic integrated signal processors have been proposed, but they usually offer limited reconfigurability, a feature highly needed for the implementation of large-scale general-purpose photonic signal processors. Here, we report and experimentally demonstrate a fully reconfigurable photonic integrated signal processor based on an InP-InGaAsP material system. The proposed photonic signal processor is capable of performing reconfigurable signal processing functions including temporal integration, temporal differentiation and Hilbert transformation. The reconfigurability is achieved by controlling the injection currents to the active components of the signal processor. Our demonstration suggests great potential for chip-scale fully programmable all-optical signal processing.
Dense and Sparse Matrix Operations on the Cell Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel W.; Shalf, John; Oliker, Leonid

2005-05-01

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
JPRS Report, Science & Technology, China, High-Performance Computer Systems

DTIC Science & Technology

1992-10-28

microprocessor array The microprocessor array in the AP85 system is com- posed of 16 completely identical array element micro - processors . Each array element...microprocessors and capable of host machine reading and writing. The memory capacity of the array element micro - processors as a whole can be expanded...transmission functions to carry out data transmission from array element micro - processor to array element microprocessor, from array element
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G.; Salapura, Valentina

2012-07-24

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Multicore Programming Challenges

NASA Astrophysics Data System (ADS)

Perrone, Michael

The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications

NASA Astrophysics Data System (ADS)

Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei

2007-04-01

In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.

Next Generation Space Telescope Integrated Science Module Data System

NASA Technical Reports Server (NTRS)

Schnurr, Richard G.; Greenhouse, Matthew A.; Jurotich, Matthew M.; Whitley, Raymond; Kalinowski, Keith J.; Love, Bruce W.; Travis, Jeffrey W.; Long, Knox S.

1999-01-01

The Data system for the Next Generation Space Telescope (NGST) Integrated Science Module (ISIM) is the primary data interface between the spacecraft, telescope, and science instrument systems. This poster includes block diagrams of the ISIM data system and its components derived during the pre-phase A Yardstick feasibility study. The poster details the hardware and software components used to acquire and process science data for the Yardstick instrument compliment, and depicts the baseline external interfaces to science instruments and other systems. This baseline data system is a fully redundant, high performance computing system. Each redundant computer contains three 150 MHz power PC processors. All processors execute a commercially available real time multi-tasking operating system supporting, preemptive multi-tasking, file management and network interfaces. These six processors in the system are networked together. The spacecraft interface baseline is an extension of the network, which links the six processors. The final selection for Processor busses, processor chips, network interfaces, and high-speed data interfaces will be made during mid 2002.
The development of a general purpose ARM-based processing unit for the ATLAS TileCal sROD

NASA Astrophysics Data System (ADS)

Cox, M. A.; Reed, R.; Mellado, B.

2015-01-01

After Phase-II upgrades in 2022, the data output from the LHC ATLAS Tile Calorimeter will increase significantly. ARM processors are common in mobile devices due to their low cost, low energy consumption and high performance. It is proposed that a cost-effective, high data throughput Processing Unit (PU) can be developed by using several consumer ARM processors in a cluster configuration to allow aggregated processing performance and data throughput while maintaining minimal software design difficulty for the end-user. This PU could be used for a variety of high-level functions on the high-throughput raw data such as spectral analysis and histograms to detect possible issues in the detector at a low level. High-throughput I/O interfaces are not typical in consumer ARM System on Chips but high data throughput capabilities are feasible via the novel use of PCI-Express as the I/O interface to the ARM processors. An overview of the PU is given and the results for performance and throughput testing of four different ARM Cortex System on Chips are presented.
An evaluation of MPI message rate on hybrid-core processors

DOE PAGES

Barrett, Brian W.; Brightwell, Ron; Grant, Ryan; ...

2014-11-01

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insightmore » into how MPI implementations for future hybrid-core processors should be designed.« less
Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Saini, Subbash; Ciotti, Robert

2006-01-01

We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
Programmable optical processor chips: toward photonic RF filters with DSP-level flexibility and MHz-band selectivity

NASA Astrophysics Data System (ADS)

Xie, Yiwei; Geng, Zihan; Zhuang, Leimeng; Burla, Maurizio; Taddei, Caterina; Hoekman, Marcel; Leinse, Arne; Roeloffzen, Chris G. H.; Boller, Klaus-J.; Lowery, Arthur J.

2017-12-01

Integrated optical signal processors have been identified as a powerful engine for optical processing of microwave signals. They enable wideband and stable signal processing operations on miniaturized chips with ultimate control precision. As a promising application, such processors enables photonic implementations of reconfigurable radio frequency (RF) filters with wide design flexibility, large bandwidth, and high-frequency selectivity. This is a key technology for photonic-assisted RF front ends that opens a path to overcoming the bandwidth limitation of current digital electronics. Here, the recent progress of integrated optical signal processors for implementing such RF filters is reviewed. We highlight the use of a low-loss, high-index-contrast stoichiometric silicon nitride waveguide which promises to serve as a practical material platform for realizing high-performance optical signal processors and points toward photonic RF filters with digital signal processing (DSP)-level flexibility, hundreds-GHz bandwidth, MHz-band frequency selectivity, and full system integration on a chip scale.
APRON: A Cellular Processor Array Simulation and Hardware Design Tool

NASA Astrophysics Data System (ADS)

Barr, David R. W.; Dudek, Piotr

2009-12-01

We present a software environment for the efficient simulation of cellular processor arrays (CPAs). This software (APRON) is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters

NASA Astrophysics Data System (ADS)

DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug

2018-03-01

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Integration, Development and Performance of the 500 TFLOPS Heterogeneous Cluster (Condor)

DTIC Science & Technology

2012-08-01

PlayStation 3 for High Performance Cluster Computing” LAPACK Working Note 185, 2007. [ 4 ] Feng, W., X. Feng, and R. Ge, “Green Supercomputing Comes of...CONFERENCE PAPER (Post Print) 3. DATES COVERED (From - To) JUN 2010 – MAY 2013 4 . TITLE AND SUBTITLE INTEGRATION, DEVELOPMENT AND PERFORMANCE OF...and streaming processing; the PlayStation 3 uses the IBM Cell BE processor, which adopts the multi-processor, single-instruction-multiple- data (SIMD
Performance analysis of the GR712RC dual-core LEON3FT SPARC V8 processor in an asymmetric multi-processing environment

NASA Astrophysics Data System (ADS)

Giusi, Giovanni; Liu, Scige J.; Galli, Emanuele; Di Giorgio, Anna M.; Farina, Maria; Vertolli, Nello; Di Lellis, Andrea M.

2016-07-01

In this paper we present the results of a series of performance tests carried out on a prototype board mounting the Cobham Gaisler GR712RC Dual Core LEON3FT processor. The aim was the characterization of the performances of the dual core processor when used for executing a highly demanding lossless compression task, acting on data segments continuously copied from the static memory to the processor RAM. The selection of the compression activity to evaluate the performances was driven by the possibility of a comparison with previously executed tests on the Cobham/Aeroflex Gaisler UT699 LEON3FT SPARC™ V8. The results of the test activity have shown a factor 1.6 of improvement with respect to the previous tests, which can easily be improved by adopting a faster onboard board clock, and provided indications on the best size of the data chunks to be used in the compression activity.
A microprocessor based high speed packet switch for satellite communications

NASA Technical Reports Server (NTRS)

Arozullah, M.; Crist, S. C.

1980-01-01

The architectures of a single processor, a three processor, and a multiple processor system are described. The hardware circuits, and software routines required for implementing the three and multiple processor designs are presented. A bit-slice microprocessor was designed and microprogrammed. Maximum throughput was calculated for all three designs. Queue theoretic models for these three designs were developed and utilized to obtain analytical expressions for the average waiting times, overall average response times and average queue sizes. From these expressions, graphs were obtained showing the effect on the system performance of a number of design parameters.
Comparing an FPGA to a Cell for an Image Processing Application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ngo, Hau; Broussard, Randy P.; Ives, Robert W.

2010-12-01

Modern advancements in configurable hardware, most notably Field-Programmable Gate Arrays (FPGAs), have provided an exciting opportunity to discover the parallel nature of modern image processing algorithms. On the other hand, PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high performance. In this research project, our aim is to study the differences in performance of a modern image processing algorithm on these two hardware platforms. In particular, Iris Recognition Systems have recently become an attractive identification method because of their extremely high accuracy. Iris matching, a repeatedly executed portion of a modern iris recognition algorithm, is parallelized on an FPGA system and a Cell processor. We demonstrate a 2.5 times speedup of the parallelized algorithm on the FPGA system when compared to a Cell processor-based version.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G; Salapura, Valentina

2014-12-02

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Acousto-optic time- and space-integrating spotlight-mode SAR processor

NASA Astrophysics Data System (ADS)

Haney, Michael W.; Levy, James J.; Michael, Robert R., Jr.

1993-09-01

The technical approach and recent experimental results for the acousto-optic time- and space- integrating real-time SAR image formation processor program are reported. The concept overcomes the size and power consumption limitations of electronic approaches by using compact, rugged, and low-power analog optical signal processing techniques for the most computationally taxing portions of the SAR imaging problem. Flexibility and performance are maintained by the use of digital electronics for the critical low-complexity filter generation and output image processing functions. The results include a demonstration of the processor's ability to perform high-resolution spotlight-mode SAR imaging by simultaneously compensating for range migration and range/azimuth coupling in the analog optical domain, thereby avoiding a highly power-consuming digital interpolation or reformatting operation usually required in all-electronic approaches.
Cedar-a large scale multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.; Kuck, D.; Lawrie, D.

1983-01-01

This paper presents an overview of Cedar, a large scale multiprocessor being designed at the University of Illinois. This machine is designed to accommodate several thousand high performance processors which are capable of working together on a single job, or they can be partitioned into groups of processors where each group of one or more processors can work on separate jobs. Various aspects of the machine are described including the control methodology, communication network, optimizing compiler and plans for construction. 13 references.
RISC Processors and High Performance Computing

NASA Technical Reports Server (NTRS)

Bailey, David H.; Saini, Subhash; Craw, James M. (Technical Monitor)

1995-01-01

This tutorial will discuss the top five RISC microprocessors and the parallel systems in which they are used. It will provide a unique cross-machine comparison not available elsewhere. The effective performance of these processors will be compared by citing standard benchmarks in the context of real applications. The latest NAS Parallel Benchmarks, both absolute performance and performance per dollar, will be listed. The next generation of the NPB will be described. The tutorial will conclude with a discussion of future directions in the field. Technology Transfer Considerations: All of these computer systems are commercially available internationally. Information about these processors is available in the public domain, mostly from the vendors themselves. The NAS Parallel Benchmarks and their results have been previously approved numerous times for public release, beginning back in 1991.
Electrical Prototype Power Processor for the 30-cm Mercury electric propulsion engine

NASA Technical Reports Server (NTRS)

Biess, J. J.; Frye, R. J.

1978-01-01

An Electrical Prototpye Power Processor has been designed to the latest electrical and performance requirements for a flight-type 30-cm ion engine and includes all the necessary power, command, telemetry and control interfaces for a typical electric propulsion subsystem. The power processor was configured into seven separate mechanical modules that would allow subassembly fabrication, test and integration into a complete power processor unit assembly. The conceptual mechanical packaging of the electrical prototype power processor unit demonstrated the relative location of power, high voltage and control electronic components to minimize electrical interactions and to provide adequate thermal control in a vacuum environment. Thermal control was accomplished with a heat pipe simulator attached to the base of the modules.
Hyperswitch Communication Network Computer

NASA Technical Reports Server (NTRS)

Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.

1993-01-01

Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Design and implementation of a high performance network security processor

NASA Astrophysics Data System (ADS)

Wang, Haixin; Bai, Guoqiang; Chen, Hongyi

2010-03-01

The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.
Implementation of kernels on the Maestro processor

NASA Astrophysics Data System (ADS)

Suh, Jinwoo; Kang, D. I. D.; Crago, S. P.

Currently, most microprocessors use multiple cores to increase performance while limiting power usage. Some processors use not just a few cores, but tens of cores or even 100 cores. One such many-core microprocessor is the Maestro processor, which is based on Tilera's TILE64 processor. The Maestro chip is a 49-core, general-purpose, radiation-hardened processor designed for space applications. The Maestro processor, unlike the TILE64, has a floating point unit (FPU) in each core for improved floating point performance. The Maestro processor runs at 342 MHz clock frequency. On the Maestro processor, we implemented several widely used kernels: matrix multiplication, vector add, FIR filter, and FFT. We measured and analyzed the performance of these kernels. The achieved performance was up to 5.7 GFLOPS, and the speedup compared to single tile was up to 49 using 49 tiles.
High-performance computing on GPUs for resistivity logging of oil and gas wells

NASA Astrophysics Data System (ADS)

Glinskikh, V.; Dudaev, A.; Nechaev, O.; Surodina, I.

2017-10-01

We developed and implemented into software an algorithm for high-performance simulation of electrical logs from oil and gas wells using high-performance heterogeneous computing. The numerical solution of the 2D forward problem is based on the finite-element method and the Cholesky decomposition for solving a system of linear algebraic equations (SLAE). Software implementations of the algorithm used the NVIDIA CUDA technology and computing libraries are made, allowing us to perform decomposition of SLAE and find its solution on central processor unit (CPU) and graphics processor unit (GPU). The calculation time is analyzed depending on the matrix size and number of its non-zero elements. We estimated the computing speed on CPU and GPU, including high-performance heterogeneous CPU-GPU computing. Using the developed algorithm, we simulated resistivity data in realistic models.

Assessment of mammographic film processor performance in a hospital and mobile screening unit.

PubMed

Murray, J G; Dowsett, D J; Laird, O; Ennis, J T

1992-12-01

In contrast to the majority of mammographic breast screening programmes, film processing at this centre occurs on site in both hospital and mobile trailer units. Initial (1989) quality control (QC) sensitometric tests revealed a large variation in film processor performance in the mobile unit. The clinical significance of these variations was assessed and acceptance limits for processor performance determined. Abnormal mammograms were used as reference material and copied using high definition 35 mm film over a range of exposure settings. The copies were than matched with QC film density variation from the mobile unit. All films were subsequently ranked for spatial and contrast resolution. Optimal values for processing time of 2 min (equivalent to film transit time 3 min and developer time 46 s) and temperature of 36 degrees C were obtained. The widespread anomaly of reporting film transit time as processing time is highlighted. Use of mammogram copies as a means of measuring the influence of film processor variation is advocated. Careful monitoring of the mobile unit film processor performance has produced stable quality comparable with the hospital based unit. The advantages of on site film processing are outlined. The addition of a sensitometric step wedge to all mammography film stock as a means of assessing image quality is recommended.
Development of compact fuel processor for 2 kW class residential PEMFCs

NASA Astrophysics Data System (ADS)

Seo, Yu Taek; Seo, Dong Joo; Jeong, Jin Hyeok; Yoon, Wang Lai

Korea Institute of Energy Research (KIER) has been developing a novel fuel processing system to provide hydrogen rich gas to residential polymer electrolyte membrane fuel cells (PEMFCs) cogeneration system. For the effective design of a compact hydrogen production system, the unit processes of steam reforming, high and low temperature water gas shift, steam generator and internal heat exchangers are thermally and physically integrated into a packaged hardware system. Several prototypes are under development and the prototype I fuel processor showed thermal efficiency of 73% as a HHV basis with methane conversion of 81%. Recently tested prototype II has been shown the improved performance of thermal efficiency of 76% with methane conversion of 83%. In both prototypes, two-stage PrOx reactors reduce CO concentration less than 10 ppm, which is the prerequisite CO limit condition of product gas for the PEMFCs stack. After confirming the initial performance of prototype I fuel processor, it is coupled with PEMFC single cell to test the durability and demonstrated that the fuel processor is operated for 3 days successfully without any failure of fuel cell voltage. Prototype II fuel processor also showed stable performance during the durability test.
Design Approach and Implementation of Application Specific Instruction Set Processor for SHA-3 BLAKE Algorithm

NASA Astrophysics Data System (ADS)

Zhang, Yuli; Han, Jun; Weng, Xinqian; He, Zhongzhu; Zeng, Xiaoyang

This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335Mbps and 176Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.
Automatic differentiation for design sensitivity analysis of structural systems using multiple processors

NASA Technical Reports Server (NTRS)

Nguyen, Duc T.; Storaasli, Olaf O.; Qin, Jiangning; Qamar, Ramzi

1994-01-01

An automatic differentiation tool (ADIFOR) is incorporated into a finite element based structural analysis program for shape and non-shape design sensitivity analysis of structural systems. The entire analysis and sensitivity procedures are parallelized and vectorized for high performance computation. Small scale examples to verify the accuracy of the proposed program and a medium scale example to demonstrate the parallel vector performance on multiple CRAY C90 processors are included.
Extended performance electric propulsion power processor design study. Volume 2: Technical summary

NASA Technical Reports Server (NTRS)

Biess, J. J.; Inouye, L. Y.; Schoenfeld, A. D.

1977-01-01

Electric propulsion power processor technology has processed during the past decade to the point that it is considered ready for application. Several power processor design concepts were evaluated and compared. Emphasis was placed on a 30 cm ion thruster power processor with a beam power rating supply of 2.2KW to 10KW for the main propulsion power stage. Extension in power processor performance were defined and were designed in sufficient detail to determine efficiency, component weight, part count, reliability and thermal control. A detail design was performed on a microprocessor as the thyristor power processor controller. A reliability analysis was performed to evaluate the effect of the control electronics redesign. Preliminary electrical design, mechanical design and thermal analysis were performed on a 6KW power transformer for the beam supply. Bi-Mod mechanical, structural and thermal control configurations were evaluated for the power processor and preliminary estimates of mechanical weight were determined.
Matrix-vector multiplication using digital partitioning for more accurate optical computing

NASA Technical Reports Server (NTRS)

Gary, C. K.

1992-01-01

Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient alogrithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor.
Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

NASA Astrophysics Data System (ADS)

Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

2013-03-01

With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.
Onboard Interferometric SAR Processor for the Ka-Band Radar Interferometer (KaRIn)

NASA Technical Reports Server (NTRS)

Esteban-Fernandez, Daniel; Rodriquez, Ernesto; Peral, Eva; Clark, Duane I.; Wu, Xiaoqing

2011-01-01

An interferometric synthetic aperture radar (SAR) onboard processor concept and algorithm has been developed for the Ka-band radar interferometer (KaRIn) instrument on the Surface and Ocean Topography (SWOT) mission. This is a mission- critical subsystem that will perform interferometric SAR processing and multi-look averaging over the oceans to decrease the data rate by three orders of magnitude, and therefore enable the downlink of the radar data to the ground. The onboard processor performs demodulation, range compression, coregistration, and re-sampling, and forms nine azimuth squinted beams. For each of them, an interferogram is generated, including common-band spectral filtering to improve correlation, followed by averaging to the final 1 1-km ground resolution pixel. The onboard processor has been prototyped on a custom FPGA-based cPCI board, which will be part of the radar s digital subsystem. The level of complexity of this technology, dictated by the implementation of interferometric SAR processing at high resolution, the extremely tight level of accuracy required, and its implementation on FPGAs are unprecedented at the time of this reporting for an onboard processor for flight applications.
Ring-array processor distribution topology for optical interconnects

NASA Technical Reports Server (NTRS)

Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

1992-01-01

The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.
Advanced Avionics and Processor Systems for a Flexible Space Exploration Architecture

NASA Technical Reports Server (NTRS)

Keys, Andrew S.; Adams, James H.; Smith, Leigh M.; Johnson, Michael A.; Cressler, John D.

2010-01-01

The Advanced Avionics and Processor Systems (AAPS) project, formerly known as the Radiation Hardened Electronics for Space Environments (RHESE) project, endeavors to develop advanced avionic and processor technologies anticipated to be used by NASA s currently evolving space exploration architectures. The AAPS project is a part of the Exploration Technology Development Program, which funds an entire suite of technologies that are aimed at enabling NASA s ability to explore beyond low earth orbit. NASA s Marshall Space Flight Center (MSFC) manages the AAPS project. AAPS uses a broad-scoped approach to developing avionic and processor systems. Investment areas include advanced electronic designs and technologies capable of providing environmental hardness, reconfigurable computing techniques, software tools for radiation effects assessment, and radiation environment modeling tools. Near-term emphasis within the multiple AAPS tasks focuses on developing prototype components using semiconductor processes and materials (such as Silicon-Germanium (SiGe)) to enhance a device s tolerance to radiation events and low temperature environments. As the SiGe technology will culminate in a delivered prototype this fiscal year, the project emphasis shifts its focus to developing low-power, high efficiency total processor hardening techniques. In addition to processor development, the project endeavors to demonstrate techniques applicable to reconfigurable computing and partially reconfigurable Field Programmable Gate Arrays (FPGAs). This capability enables avionic architectures the ability to develop FPGA-based, radiation tolerant processor boards that can serve in multiple physical locations throughout the spacecraft and perform multiple functions during the course of the mission. The individual tasks that comprise AAPS are diverse, yet united in the common endeavor to develop electronics capable of operating within the harsh environment of space. Specifically, the AAPS tasks for the Federal fiscal year of 2010 are: Silicon-Germanium (SiGe) Integrated Electronics for Extreme Environments, Modeling of Radiation Effects on Electronics, Radiation Hardened High Performance Processors (HPP), and and Reconfigurable Computing.
Combining high performance simulation, data acquisition, and graphics display computers

NASA Technical Reports Server (NTRS)

Hickman, Robert J.

1989-01-01

Issues involved in the continuing development of an advanced simulation complex are discussed. This approach provides the capability to perform the majority of tests on advanced systems, non-destructively. The controlled test environments can be replicated to examine the response of the systems under test to alternative treatments of the system control design, or test the function and qualification of specific hardware. Field tests verify that the elements simulated in the laboratories are sufficient. The digital computer is hosted by a Digital Equipment Corp. MicroVAX computer with an Aptec Computer Systems Model 24 I/O computer performing the communication function. An Applied Dynamics International AD100 performs the high speed simulation computing and an Evans and Sutherland PS350 performs on-line graphics display. A Scientific Computer Systems SCS40 acts as a high performance FORTRAN program processor to support the complex, by generating numerous large files from programs coded in FORTRAN that are required for the real time processing. Four programming languages are involved in the process, FORTRAN, ADSIM, ADRIO, and STAPLE. FORTRAN is employed on the MicroVAX host to initialize and terminate the simulation runs on the system. The generation of the data files on the SCS40 also is performed with FORTRAN programs. ADSIM and ADIRO are used to program the processing elements of the AD100 and its IOCP processor. STAPLE is used to program the Aptec DIP and DIA processors.
Importance of balanced architectures in the design of high-performance imaging systems

NASA Astrophysics Data System (ADS)

Sgro, Joseph A.; Stanton, Paul C.

1999-03-01

Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Implementation of a cone-beam backprojection algorithm on the cell broadband engine processor

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Knaup, Michael; Kachelrieß, Marc

2007-03-01

Tomographic image reconstruction is computationally very demanding. In all cases the backprojection represents the performance bottleneck due to the high operational count and due to the high demand put on the memory subsystem. In the past, solving this problem has lead to the implementation of specific architectures, connecting Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) to memory through dedicated high speed busses. More recently, there have also been attempt to use Graphic Processing Units (GPUs) to perform the backprojection step. Originally aimed at the gaming market, IBM, Toshiba and Sony have introduced the Cell Broadband Engine (CBE) processor, often considered as a multicomputer on a chip. Clocked at 3 GHz, the Cell allows for a theoretical performance of 192 GFlops and a peak data transfer rate over the internal bus of 200 GB/s. This performance indeed makes the Cell a very attractive architecture for implementing tomographic image reconstruction algorithms. In this study, we investigate the relative performance of a perspective backprojection algorithm when implemented on a standard PC and on the Cell processor. We compare these results to the performance achievable with FPGAs based boards and high end GPUs. The cone-beam backprojection performance was assessed by backprojecting a full circle scan of 512 projections of 1024x1024 pixels into a volume of size 512x512x512 voxels. It took 3.2 minutes on the PC (single CPU) and is as fast as 13.6 seconds on the Cell.
Right-Brain/Left-Brain Integrated Associative Processor Employing Convertible Multiple-Instruction-Stream Multiple-Data-Stream Elements

NASA Astrophysics Data System (ADS)

Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi

2005-04-01

A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Special purpose parallel computer architecture for real-time control and simulation in robotic applications

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1993-01-01

This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Thermal Hotspots in CPU Die and It's Future Architecture

NASA Astrophysics Data System (ADS)

Wang, Jian; Hu, Fu-Yuan

Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.

Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi

NASA Astrophysics Data System (ADS)

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad

2015-05-01

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Power processor for a 20CM ion thruster

NASA Technical Reports Server (NTRS)

Biess, J. J.; Schoenfeld, A. D.; Cohen, E.

1973-01-01

A power processor breadboard for the JPL 20CM Ion Engine was designed, fabricated, and tested to determine compliance with the electrical specification. The power processor breadboard used the silicon-controlled rectifier (SCR) series resonant inverter as the basic power stage to process all the power to the ion engine. The breadboard power processor was integrated with the JPL 20CM ion engine and complete testing was performed. The integration tests were performed without any silicon-controlled rectifier failure. This demonstrated the ruggedness of the series resonant inverter in protecting the switching elements during arcing in the ion engine. A method of fault clearing the ion engine and returning back to normal operation without elaborate sequencing and timing control logic was evolved. In this method, the main vaporizer was turned off and the discharge current limit was reduced when an overload existed on the screen/accelerator supply. After the high voltage returned to normal, both the main vaporizer and the discharge were returned to normal.
Is random access memory random?

NASA Technical Reports Server (NTRS)

Denning, P. J.

1986-01-01

Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

PubMed Central

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
3D environment modeling and location tracking using off-the-shelf components

NASA Astrophysics Data System (ADS)

Luke, Robert H.

2016-05-01

The remarkable popularity of smartphones over the past decade has led to a technological race for dominance in market share. This has resulted in a flood of new processors and sensors that are inexpensive, low power and high performance. These sensors include accelerometers, gyroscope, barometers and most importantly cameras. This sensor suite, coupled with multicore processors, allows a new community of researchers to build small, high performance platforms for low cost. This paper describes a system using off-the-shelf components to perform position tracking as well as environment modeling. The system relies on tracking using stereo vision and inertial navigation to determine movement of the system as well as create a model of the environment sensed by the system.
Reward-based learning under hardware constraints-using a RISC processor embedded in a neuromorphic substrate.

PubMed

Friedmann, Simon; Frémaux, Nicolas; Schemmel, Johannes; Gerstner, Wulfram; Meier, Karlheinz

2013-01-01

In this study, we propose and analyze in simulations a new, highly flexible method of implementing synaptic plasticity in a wafer-scale, accelerated neuromorphic hardware system. The study focuses on globally modulated STDP, as a special use-case of this method. Flexibility is achieved by embedding a general-purpose processor dedicated to plasticity into the wafer. To evaluate the suitability of the proposed system, we use a reward modulated STDP rule in a spike train learning task. A single layer of neurons is trained to fire at specific points in time with only the reward as feedback. This model is simulated to measure its performance, i.e., the increase in received reward after learning. Using this performance as baseline, we then simulate the model with various constraints imposed by the proposed implementation and compare the performance. The simulated constraints include discretized synaptic weights, a restricted interface between analog synapses and embedded processor, and mismatch of analog circuits. We find that probabilistic updates can increase the performance of low-resolution weights, a simple interface between analog synapses and processor is sufficient for learning, and performance is insensitive to mismatch. Further, we consider communication latency between wafer and the conventional control computer system that is simulating the environment. This latency increases the delay, with which the reward is sent to the embedded processor. Because of the time continuous operation of the analog synapses, delay can cause a deviation of the updates as compared to the not delayed situation. We find that for highly accelerated systems latency has to be kept to a minimum. This study demonstrates the suitability of the proposed implementation to emulate the selected reward modulated STDP learning rule. It is therefore an ideal candidate for implementation in an upgraded version of the wafer-scale system developed within the BrainScaleS project.
Reward-based learning under hardware constraints—using a RISC processor embedded in a neuromorphic substrate

PubMed Central

Friedmann, Simon; Frémaux, Nicolas; Schemmel, Johannes; Gerstner, Wulfram; Meier, Karlheinz

2013-01-01

In this study, we propose and analyze in simulations a new, highly flexible method of implementing synaptic plasticity in a wafer-scale, accelerated neuromorphic hardware system. The study focuses on globally modulated STDP, as a special use-case of this method. Flexibility is achieved by embedding a general-purpose processor dedicated to plasticity into the wafer. To evaluate the suitability of the proposed system, we use a reward modulated STDP rule in a spike train learning task. A single layer of neurons is trained to fire at specific points in time with only the reward as feedback. This model is simulated to measure its performance, i.e., the increase in received reward after learning. Using this performance as baseline, we then simulate the model with various constraints imposed by the proposed implementation and compare the performance. The simulated constraints include discretized synaptic weights, a restricted interface between analog synapses and embedded processor, and mismatch of analog circuits. We find that probabilistic updates can increase the performance of low-resolution weights, a simple interface between analog synapses and processor is sufficient for learning, and performance is insensitive to mismatch. Further, we consider communication latency between wafer and the conventional control computer system that is simulating the environment. This latency increases the delay, with which the reward is sent to the embedded processor. Because of the time continuous operation of the analog synapses, delay can cause a deviation of the updates as compared to the not delayed situation. We find that for highly accelerated systems latency has to be kept to a minimum. This study demonstrates the suitability of the proposed implementation to emulate the selected reward modulated STDP learning rule. It is therefore an ideal candidate for implementation in an upgraded version of the wafer-scale system developed within the BrainScaleS project. PMID:24065877
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
Geospace simulations on the Cell BE processor

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D.

2008-12-01

OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi

DOE PAGES

Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...

2015-05-22

Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
Orthorectification by Using Gpgpu Method

NASA Astrophysics Data System (ADS)

Sahin, H.; Kulur, S.

2012-07-01

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
Sequence information signal processor for local and global string comparisons

DOEpatents

Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

1997-01-01

A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.
High-performance ultra-low power VLSI analog processor for data compression

NASA Technical Reports Server (NTRS)

Tawel, Raoul (Inventor)

1996-01-01

An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.
Extended performance electric propulsion power processor design study. Volume 1: Executive summary

NASA Technical Reports Server (NTRS)

Biess, J. J.; Inouye, L. Y.; Schoenfeld, A. D.

1977-01-01

Several power processor design concepts were evaluated and compared. Emphasis was placed on a 30cm ion thruster power processor with a beam supply rating of 2.2kW to 10kW. Extensions in power processor performance were defined and were designed in sufficient detail to determine efficiency, component weight, part count, reliability and thermal control. Preliminary electrical design, mechanical design, and thermal analysis were performed on a 6kW power transformer for the beam supply. Bi-Mod mechanical, structural, and thermal control configurations were evaluated for the power processor, and preliminary estimates of mechanical weight were determined. A program development plan was formulated that outlines the work breakdown structure for the development, qualification and fabrication of the power processor flight hardware.
Feasibility of optically interconnected parallel processors using wavelength division multiplexing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little informationmore » is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.« less
Hardware multiplier processor

DOEpatents

Pierce, Paul E.

1986-01-01

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Hardware multiplier processor

DOEpatents

Pierce, P.E.

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

Implementing Access to Data Distributed on Many Processors

NASA Technical Reports Server (NTRS)

James, Mark

2006-01-01

A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization.
Scalable Multiprocessor for High-Speed Computing in Space

NASA Technical Reports Server (NTRS)

Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard

2004-01-01

A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Simulation of a master-slave event set processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Comfort, J.C.

1984-03-01

Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
ORNL Cray X1 evaluation status report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, P.K.; Alexander, R.A.; Apra, E.

2004-05-01

On August 15, 2002 the Department of Energy (DOE) selected the Center for Computational Sciences (CCS) at Oak Ridge National Laboratory (ORNL) to deploy a new scalable vector supercomputer architecture for solving important scientific problems in climate, fusion, biology, nanoscale materials and astrophysics. ''This program is one of the first steps in an initiative designed to provide U.S. scientists with the computational power that is essential to 21st century scientific leadership,'' said Dr. Raymond L. Orbach, director of the department's Office of Science. In FY03, CCS procured a 256-processor Cray X1 to evaluate the processors, memory subsystem, scalability of themore » architecture, software environment and to predict the expected sustained performance on key DOE applications codes. The results of the micro-benchmarks and kernel bench marks show the architecture of the Cray X1 to be exceptionally fast for most operations. The best results are shown on large problems, where it is not possible to fit the entire problem into the cache of the processors. These large problems are exactly the types of problems that are important for the DOE and ultra-scale simulation. Application performance is found to be markedly improved by this architecture: - Large-scale simulations of high-temperature superconductors run 25 times faster than on an IBM Power4 cluster using the same number of processors. - Best performance of the parallel ocean program (POP v1.4.3) is 50 percent higher than on Japan s Earth Simulator and 5 times higher than on an IBM Power4 cluster. - A fusion application, global GYRO transport, was found to be 16 times faster on the X1 than on an IBM Power3. The increased performance allowed simulations to fully resolve questions raised by a prior study. - The transport kernel in the AGILE-BOLTZTRAN astrophysics code runs 15 times faster than on an IBM Power4 cluster using the same number of processors. - Molecular dynamics simulations related to the phenomenon of photon echo run 8 times faster than previously achieved. Even at 256 processors, the Cray X1 system is already outperforming other supercomputers with thousands of processors for a certain class of applications such as climate modeling and some fusion applications. This evaluation is the outcome of a number of meetings with both high-performance computing (HPC) system vendors and application experts over the past 9 months and has received broad-based support from the scientific community and other agencies.« less
Rate and timing cues associated with the cochlear amplifier: level discrimination based on monaural cross-frequency coincidence detection.

PubMed

Heinz, M G; Colburn, H S; Carney, L H

2001-10-01

The perceptual significance of the cochlear amplifier was evaluated by predicting level-discrimination performance based on stochastic auditory-nerve (AN) activity. Performance was calculated for three models of processing: the optimal all-information processor (based on discharge times), the optimal rate-place processor (based on discharge counts), and a monaural coincidence-based processor that uses a non-optimal combination of rate and temporal information. An analytical AN model included compressive magnitude and level-dependent-phase responses associated with the cochlear amplifier, and high-, medium-, and low-spontaneous-rate (SR) fibers with characteristic frequencies (CFs) spanning the AN population. The relative contributions of nonlinear magnitude and nonlinear phase responses to level encoding were compared by using four versions of the model, which included and excluded the nonlinear gain and phase responses in all possible combinations. Nonlinear basilar-membrane (BM) phase responses are robustly encoded in near-CF AN fibers at low frequencies. Strongly compressive BM responses at high frequencies near CF interact with the high thresholds of low-SR AN fibers to produce large dynamic ranges. Coincidence performance based on a narrow range of AN CFs was robust across a wide dynamic range at both low and high frequencies, and matched human performance levels. Coincidence performance based on all CFs demonstrated the "near-miss" to Weber's law at low frequencies and the high-frequency "mid-level bump." Monaural coincidence detection is a physiologically realistic mechanism that is extremely general in that it can utilize AN information (average-rate, synchrony, and nonlinear-phase cues) from all SR groups.
A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

DOEpatents

Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.

2002-01-01

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.
High performance, low cost, self-contained, multipurpose PC based ground systems

NASA Technical Reports Server (NTRS)

Forman, Michael; Nickum, William; Troendly, Gregory

1993-01-01

The use of embedded processors greatly enhances the capabilities of personal computers when used for telemetry processing and command control center functions. Parallel architectures based on the use of transputers are shown to be very versatile and reusable, and the synergism between the PC and the embedded processor with transputers results in single unit, low cost workstations of 20 less than MIPS less than or equal to 1000.
A floating-point/multiple-precision processor for airborne applications

NASA Technical Reports Server (NTRS)

Yee, R.

1982-01-01

A compact input output (I/O) numerical processor capable of performing floating-point, multiple precision and other arithmetic functions at execution times which are at least 100 times faster than comparable software emulation is described. The I/O device is a microcomputer system containing a 16 bit microprocessor, a numerical coprocessor with eight 80 bit registers running at a 5 MHz clock rate, 18K random access memory (RAM) and 16K electrically programmable read only memory (EPROM). The processor acts as an intelligent slave to the host computer and can be programmed in high order languages such as FORTRAN and PL/M-86.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Multi-threaded parallel simulation of non-local non-linear problems in ultrashort laser pulse propagation in the presence of plasma

NASA Astrophysics Data System (ADS)

Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger

2011-06-01

We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
QERx- A Faster than Real-Time Emulator for Space Processors

NASA Astrophysics Data System (ADS)

Carvalho, B.; Pidgeon, A.; Robinson, P.

2012-08-01

Developing software for space systems is challenging. Especially because, in order to be sure it can cope with the harshness of the environment and the imperative requirements and constrains imposed by the platform were it will run, it needs to be tested exhaustively. Software Validation Facilities (SVF) are known to the industry and developers, and provide the means to run the On-Board Software (OBSW) in a realistic environment, allowing the development team to debug and test the software.But the challenge is to be able to keep up with the performance of the new processors (LEON2 and LEON3), which need to be emulated within the SVF. Such processor emulators are also used in Operational Simulators, used to support mission preparation and train mission operators. These simulators mimic the satellite and its behaviour, as realistically as possible. For test/operational efficiency reasons and because they will need to interact with external systems, both these uses cases require the processor emulators to provide real-time, or faster, performance.It is known to the industry that the performance of previously available emulators is not enough to cope with the performance of the new processors available in the market. SciSys approached this problem with dynamic translation technology trying to keep costs down by avoiding a hardware solution and keeping the integration flexibility of full software emulation.SciSys presented “QERx: A High Performance Emulator for Software Validation and Simulations” [1], in a previous DASIA event. Since then that idea has evolved and QERx has been successfully validated. SciSys is now presenting QERx as a product that can be tailored to fit different emulation needs. This paper will present QERx latest developments and current status.
A scalable SIMD digital signal processor for high-quality multifunctional printer systems

NASA Astrophysics Data System (ADS)

Kang, Hyeong-Ju; Choi, Yongwoo; Kim, Kimo; Park, In-Cheol; Kim, Jung-Wook; Lee, Eul-Hwan; Gahang, Goo-Soo

2005-01-01

This paper describes a high-performance scalable SIMD digital signal processor (DSP) developed for multifunctional printer systems. The DSP supports a variable number of datapaths to cover a wide range of performance and maintain a RISC-like pipeline structure. Many special instructions suitable for image processing algorithms are included in the DSP. Quad/dual instructions are introduced for 8-bit or 16-bit data, and bit-field extraction/insertion instructions are supported to process various data types. Conditional instructions are supported to deal with complex relative conditions efficiently. In addition, an intelligent DMA block is integrated to align data in the course of data reading. Experimental results show that the proposed DSP outperforms a high-end printer-system DSP by at least two times.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing

NASA Astrophysics Data System (ADS)

Johl, John T.; Baker, Nick C.

1988-10-01

The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
A light hydrocarbon fuel processor producing high-purity hydrogen

NASA Astrophysics Data System (ADS)

Löffler, Daniel G.; Taylor, Kyle; Mason, Dylan

This paper discusses the design process and presents performance data for a dual fuel (natural gas and LPG) fuel processor for PEM fuel cells delivering between 2 and 8 kW electric power in stationary applications. The fuel processor resulted from a series of design compromises made to address different design constraints. First, the product quality was selected; then, the unit operations needed to achieve that product quality were chosen from the pool of available technologies. Next, the specific equipment needed for each unit operation was selected. Finally, the unit operations were thermally integrated to achieve high thermal efficiency. Early in the design process, it was decided that the fuel processor would deliver high-purity hydrogen. Hydrogen can be separated from other gases by pressure-driven processes based on either selective adsorption or permeation. The pressure requirement made steam reforming (SR) the preferred reforming technology because it does not require compression of combustion air; therefore, steam reforming is more efficient in a high-pressure fuel processor than alternative technologies like autothermal reforming (ATR) or partial oxidation (POX), where the combustion occurs at the pressure of the process stream. A low-temperature pre-reformer reactor is needed upstream of a steam reformer to suppress coke formation; yet, low temperatures facilitate the formation of metal sulfides that deactivate the catalyst. For this reason, a desulfurization unit is needed upstream of the pre-reformer. Hydrogen separation was implemented using a palladium alloy membrane. Packed beds were chosen for the pre-reformer and reformer reactors primarily because of their low cost, relatively simple operation and low maintenance. Commercial, off-the-shelf balance of plant (BOP) components (pumps, valves, and heat exchangers) were used to integrate the unit operations. The fuel processor delivers up to 100 slm hydrogen >99.9% pure with <1 ppm CO, <3 ppm CO 2. The thermal efficiency is better than 67% operating at full load. This fuel processor has been integrated with a 5-kW fuel cell producing electricity and hot water.
Geospace simulations using modern accelerator processor technology

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D. J.

2009-12-01

OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
Life sciences flight experiments microcomputer

NASA Technical Reports Server (NTRS)

Bartram, Peter N.

1987-01-01

A promising microcomputer configuration for the Spacelab Life Sciences Lab. Equipment inventory consists of multiple processors. One processor's use is reserved, with additional processors dedicated to real time input and output operations. A simple form of such a configuration, with a processor board for analog to digital conversion and another processor board for digital to analog conversion, was studied. The system used digital parallel data lines between the boards, operating independently of the system bus. Good performance of individual components was demonstrated: the analog to digital converter was at over 10,000 samples per second. The combination of the data transfer between boards with the input or output functions on each board slowed performance, with a maximum throughput of 2800 to 2900 analog samples per second. Any of several techniques, such as use of the system bus for data transfer or the addition of direct memory access hardware to the processor boards, should give significantly improved performance.
Ion Thruster Development at NASA Lewis Research Center

NASA Technical Reports Server (NTRS)

Sovey, James S.; Hamley, John A.; Patterson, Michael J.; Rawlin, Vincent K.; Sarver-Verhey, Timothy R.

1992-01-01

Recent ion propulsion technology efforts at NASA's Lewis Research Center including development of kW-class xenon ion thrusters, high power xenon and krypton ion thrusters, and power processors are reviewed. Thruster physical characteristics, performance data, life projections, and power processor component technology are summarized. The ion propulsion technology program is structured to address a broad set of mission applications from satellite stationkeeping and repositioning to primary propulsion using solar or nuclear power systems.
FPGA-based distributed computing microarchitecture for complex physical dynamics investigation.

PubMed

Borgese, Gianluca; Pace, Calogero; Pantano, Pietro; Bilotta, Eleonora

2013-09-01

In this paper, we present a distributed computing system, called DCMARK, aimed at solving partial differential equations at the basis of many investigation fields, such as solid state physics, nuclear physics, and plasma physics. This distributed architecture is based on the cellular neural network paradigm, which allows us to divide the differential equation system solving into many parallel integration operations to be executed by a custom multiprocessor system. We push the number of processors to the limit of one processor for each equation. In order to test the present idea, we choose to implement DCMARK on a single FPGA, designing the single processor in order to minimize its hardware requirements and to obtain a large number of easily interconnected processors. This approach is particularly suited to study the properties of 1-, 2- and 3-D locally interconnected dynamical systems. In order to test the computing platform, we implement a 200 cells, Korteweg-de Vries (KdV) equation solver and perform a comparison between simulations conducted on a high performance PC and on our system. Since our distributed architecture takes a constant computing time to solve the equation system, independently of the number of dynamical elements (cells) of the CNN array, it allows us to reduce the elaboration time more than other similar systems in the literature. To ensure a high level of reconfigurability, we design a compact system on programmable chip managed by a softcore processor, which controls the fast data/control communication between our system and a PC Host. An intuitively graphical user interface allows us to change the calculation parameters and plot the results.

Linear Spectral Analysis of Plume Emissions Using an Optical Matrix Processor

NASA Technical Reports Server (NTRS)

Gary, C. K.

1992-01-01

Plume spectrometry provides a means to monitor the health of a burning rocket engine, and optical matrix processors provide a means to analyze the plume spectra in real time. By observing the spectrum of the exhaust plume of a rocket engine, researchers have detected anomalous behavior of the engine and have even determined the failure of some equipment before it would normally have been noticed. The spectrum of the plume is analyzed by isolating information in the spectrum about the various materials present to estimate what materials are being burned in the engine. Scientists at the Marshall Space Flight Center (MSFC) have implemented a high resolution spectrometer to discriminate the spectral peaks of the many species present in the plume. Researchers at the Stennis Space Center Demonstration Testbed Facility (DTF) have implemented a high resolution spectrometer observing a 1200-lb. thrust engine. At this facility, known concentrations of contaminants can be introduced into the burn, allowing for the confirmation of diagnostic algorithms. While the high resolution of the measured spectra has allowed greatly increased insight into the functioning of the engine, the large data flows generated limit the ability to perform real-time processing. The use of an optical matrix processor and the linear analysis technique described below may allow for the detailed real-time analysis of the engine's health. A small optical matrix processor can perform the required mathematical analysis both quicker and with less energy than a large electronic computer dedicated to the same spectral analysis routine.
Neurovision processor for designing intelligent sensors

NASA Astrophysics Data System (ADS)

Gupta, Madan M.; Knopf, George K.

1992-03-01

A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.
Techniques for the rapid display and manipulation of 3-D biomedical data.

PubMed

Goldwasser, S M; Reynolds, R A; Talton, D A; Walsh, E S

1988-01-01

The use of fully interactive 3-D workstations with true real-time performance will become increasingly common as technology matures and economical commercial systems become available. This paper provides a comprehensive introduction to high speed approaches to the display and manipulation of 3-D medical objects obtained from tomographic data acquisition systems such as CT, MR, and PET. A variety of techniques are outlined including the use of software on conventional minicomputers, hardware assist devices such as array processors and programmable frame buffers, and special purpose computer architecture for dedicated high performance systems. While both algorithms and architectures are addressed, the major theme centers around the utilization of hardware-based approaches including parallel processors for the implementation of true real-time systems.
High-performance multiprocessor architecture for a 3-D lattice gas model

NASA Technical Reports Server (NTRS)

Lee, F.; Flynn, M.; Morf, M.

1991-01-01

The lattice gas method has recently emerged as a promising discrete particle simulation method in areas such as fluid dynamics. We present a very high-performance scalable multiprocessor architecture, called ALGE, proposed for the simulation of a realistic 3-D lattice gas model, Henon's 24-bit FCHC isometric model. Each of these VLSI processors is as powerful as a CRAY-2 for this application. ALGE is scalable in the sense that it achieves linear speedup for both fixed and increasing problem sizes with more processors. The core computation of a lattice gas model consists of many repetitions of two alternating phases: particle collision and propagation. Functional decomposition by symmetry group and virtual move are the respective keys to efficient implementation of collision and propagation.
Optimally stopped variational quantum algorithms

NASA Astrophysics Data System (ADS)

Vinci, Walter; Shabani, Alireza

2018-04-01

Quantum processors promise a paradigm shift in high-performance computing which needs to be assessed by accurate benchmarking measures. In this article, we introduce a benchmark for the variational quantum algorithm (VQA), recently proposed as a heuristic algorithm for small-scale quantum processors. In VQA, a classical optimization algorithm guides the processor's quantum dynamics to yield the best solution for a given problem. A complete assessment of the scalability and competitiveness of VQA should take into account both the quality and the time of dynamics optimization. The method of optimal stopping, employed here, provides such an assessment by explicitly including time as a cost factor. Here, we showcase this measure for benchmarking VQA as a solver for some quadratic unconstrained binary optimization. Moreover, we show that a better choice for the cost function of the classical routine can significantly improve the performance of the VQA algorithm and even improve its scaling properties.
Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2

NASA Technical Reports Server (NTRS)

Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.0X speedup on 64 processors when 10 percent of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all the mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
Parallel Implementation of an Adaptive Scheme for 3D Unstructured Grids on the SP2

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak; Strawn, Roger C.

1996-01-01

Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.OX speedup on 64 processors when 10% of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
RASSP signal processing architectures

NASA Astrophysics Data System (ADS)

Shirley, Fred; Bassett, Bob; Letellier, J. P.

1995-06-01

The rapid prototyping of application specific signal processors (RASSP) program is an ARPA/tri-service effort to dramatically improve the process by which complex digital systems, particularly embedded signal processors, are specified, designed, documented, manufactured, and supported. The domain of embedded signal processing was chosen because it is important to a variety of military and commercial applications as well as for the challenge it presents in terms of complexity and performance demands. The principal effort is being performed by two major contractors, Lockheed Sanders (Nashua, NH) and Martin Marietta (Camden, NJ). For both, improvements in methodology are to be exercised and refined through the performance of individual 'Demonstration' efforts. The Lockheed Sanders' Demonstration effort is to develop an infrared search and track (IRST) processor. In addition, both contractors' results are being measured by a series of externally administered (by Lincoln Labs) six-month Benchmark programs that measure process improvement as a function of time. The first two Benchmark programs are designing and implementing a synthetic aperture radar (SAR) processor. Our demonstration team is using commercially available VME modules from Mercury Computer to assemble a multiprocessor system scalable from one to hundreds of Intel i860 microprocessors. Custom modules for the sensor interface and display driver are also being developed. This system implements either proprietary or Navy owned algorithms to perform the compute-intensive IRST function in real time in an avionics environment. Our Benchmark team is designing custom modules using commercially available processor ship sets, communication submodules, and reconfigurable logic devices. One of the modules contains multiple vector processors optimized for fast Fourier transform processing. Another module is a fiberoptic interface that accepts high-rate input data from the sensors and provides video-rate output data to a display. This paper discusses the impact of simulation on choosing signal processing algorithms and architectures, drawing from the experiences of the Demonstration and Benchmark inter-company teams at Lockhhed Sanders, Motorola, Hughes, and ISX.
Prototype Focal-Plane-Array Optoelectronic Image Processor

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey

1995-01-01

Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.
Multiphase complete exchange on Paragon, SP2 and CS-2

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1995-01-01

The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. On hypercubes, multiphase complete exchange has been developed and shown to provide optimal performance over varying message sizes. Most commercial multicomputer systems do not have a hypercube interconnect. However, they use special purpose hardware and dedicated communication processors to achieve very high performance communication and can be made to emulate the hypercube quite well. Multiphase complete exchange has been implemented on three contemporary parallel architectures: the Intel Paragon, IBM SP2 and Meiko CS-2. The essential features of these machines are described and their basic interprocessor communication overheads are discussed. The performance of multiphase complete exchange is evaluated on each machine. It is shown that the theoretical ideas developed for hypercubes are also applicable in practice to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.
Spacewire on Earth orbiting scatterometers

NASA Technical Reports Server (NTRS)

Bachmann, Alex; Lang, Minh; Lux, James; Steffke, Richard

2002-01-01

The need for a high speed, reliable and easy to implement communication link has led to the development of a space flight oriented version of IEEE 1355 called SpaceWire. SpaceWire is based on high-speed (200 Mbps) serial point-to-point links using Low Voltage Differential Signaling (LVDS). SpaceWIre has provisions for routing messages between a large network of processors, using wormhole routing for low overhead and latency. {additionally, there are available space qualified hybrids, which provide the Link layer to the user's bus}. A test bed of multiple digital signal processor breadboards, demonstrating the ability to meet signal processing requirements for an orbiting scatterometer has been implemented using three Astrium MCM-DSPs, each breadboard consists of a Multi Chip Module (MCM) that combines a space qualified Digital Signal Processor and peripherals, including IEEE-1355 links. With the addition of appropriate physical layer interfaces and software on the DSP, the SpaceWire link is used to communicate between processors on the test bed, e.g. sending timing references, commands, status, and science data among the processors. Results are presented on development issues surrounding the use of SpaceWire in this environment, from physical layer implementation (cables, connectors, LVDS drivers) to diagnostic tools, driver firmware, and development methodology. The tools, methods, and hardware, software challenges and preliminary performance are investigated and discussed.
Architectural Techniques For Managing Non-volatile Caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh

As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making themmore » a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.« less
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
RISC Processors and High Performance Computing

NASA Technical Reports Server (NTRS)

Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

1995-01-01

In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.
Mechanical particle coating using polymethacrylate nanoparticle agglomerates for the preparation of controlled release fine particles: The relationship between coating performance and the characteristics of various polymethacrylates.

PubMed

Kondo, Keita; Kato, Shinsuke; Niwa, Toshiyuki

2017-10-30

We aimed to understand the factors controlling mechanical particle coating using polymethacrylate. The relationship between coating performance and the characteristics of polymethacrylate powders was investigated. First, theophylline crystals were treated using a mechanical powder processor to obtain theophylline spheres (<100μm). Second, five polymethacrylate latexes were powdered by spray freeze drying to produce colloidal agglomerates. Finally, mechanical particle coating was performed by mixing theophylline spheres and polymethacrylate agglomerates using the processor. The agglomerates were broken under mechanical stress to coat the spheres effectively. The coating performance of polymethacrylate agglomerates tended to increase as their pulverization progressed. Differences in the grindability of the agglomerates were attributed to differences in particle structure, resulting from consolidation between colloidal particles. High-grindability agglomerates exhibited higher pulverization as their glass transition temperature (T g ) increased and the further pulverization promoted coating. We therefore conclude that the minimization of polymethacrylate powder by pulverization is an important factor in mechanical particle coating using polymethacrylate with low deformability. Meanwhile, when product temperature during coating approaches T g of polymer, polymethacrylate was soften to show high coating performance by plastic deformation. The effective coating by this mechanism may be accomplished by adjusting the temperature in the processor to the T g . Copyright © 2017 Elsevier B.V. All rights reserved.
Extraction of Volatiles from Regolith or Soil on Mars, the Moon, and Asteroids

NASA Technical Reports Server (NTRS)

Linne, Diane; Kleinhenz, Julie; Trunek, Andrew; Hoffman, Stephen; Collins, Jacob

2017-01-01

NASA's Advanced Exploration Systems ISRU Technology Project is evaluating concepts to extract water from all resource types Near-term objectives: Produce high-fidelity mass, power, and volume estimates for mining and processing systems Identify critical challenges for development focus Begin demonstration of component and subsystem technologies in relevant environment Several processor types: Closed processors either partially or completely sealed during processing Open air processors operates at Mars ambient conditions In-situ processors Extract product directly without excavation of raw resource Design features Elimination of sweep gas reduces dust particles in water condensate Pressure maintained by height of soil in hopper Model developed to evaluate key design parameters Geometry: conveyor diameter, screw diameter, shaft diameter, flight spacing and pitch Operational: screw speed vs. screw length (residence time) Thermal: Heat flux, heat transfer to soil Testing to demonstrate feasibility and performance Agglomeration, clogging Pressure rise forced flow to condenser.
Vectorization with SIMD extensions speeds up reconstruction in electron tomography.

PubMed

Agulleiro, J I; Garzón, E M; García, I; Fernández, J J

2010-06-01

Electron tomography allows structural studies of cellular structures at molecular detail. Large 3D reconstructions are needed to meet the resolution requirements. The processing time to compute these large volumes may be considerable and so, high performance computing techniques have been used traditionally. This work presents a vector approach to tomographic reconstruction that relies on the exploitation of the SIMD extensions available in modern processors in combination to other single processor optimization techniques. This approach succeeds in producing full resolution tomograms with an important reduction in processing time, as evaluated with the most common reconstruction algorithms, namely WBP and SIRT. The main advantage stems from the fact that this approach is to be run on standard computers without the need of specialized hardware, which facilitates the development, use and management of programs. Future trends in processor design open excellent opportunities for vector processing with processor's SIMD extensions in the field of 3D electron microscopy.
System Architecture For High Speed Sorting Of Potatoes

NASA Astrophysics Data System (ADS)

Marchant, J. A.; Onyango, C. M.; Street, M. J.

1989-03-01

This paper illustrates an industrial application of vision processing in which potatoes are sorted according to their size and shape at speeds of up to 40 objects per second. The result is a multi-processing approach built around the VME bus. A hardware unit has been designed and constructed to encode the boundary of the potatoes, to reducing the amount of data to be processed. A master 68000 processor is used to control this unit and to handle data transfers along the bus. Boundary data is passed to one of three 68010 slave processors each responsible for a line of potatoes across a conveyor belt. The slave processors calculate attributes such as shape, size and estimated weight of each potato and the master processor uses this data to operate the sorting mechanism. The system has been interfaced with a commercial grading machine and performance trials are now in progress.
Fault-Tolerant, Radiation-Hard DSP

NASA Technical Reports Server (NTRS)

Czajkowski, David

2011-01-01

Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high-performance signal processing include significant increase in onboard science data processing, enabling orders of magnitude reduction in required communication bandwidth for science data return, orders of magnitude improvement in onboard mission planning and critical decision making, and the ability to rapidly respond to changing mission environments, thus enabling opportunistic science and orders of magnitude reduction in the cost of mission operations through reduction of required staff. Additional benefits of COTS-based, high-performance signal processing include the ability to leverage considerable commercial and academic investments in advanced computing tools, techniques, and infra structure, and the familiarity of the science and IT community with these computing environments.
Emerging Radio and Manet Technology Study: Research Support for a Survey of State-of-the-art Commercial and Military Hardware/Software for Mobile Ad Hoc Networks

DTIC Science & Technology

2014-10-01

44 Table 19: Raspberry Pi Information...boards – These are single board devices targeted to education and embedding, the best known being the Raspberry Pi ; and 3. Development boards – These...popular, as it has high performance processor (perhaps 4 times the power of a Raspberry Pi ) with dual core processors running at 1.6 GHz and the cost is

A Unified View of Global Instability of Compressible Flow over Open Cavities

DTIC Science & Technology

2006-03-28

in terms of number of steps realized by the DNS code per second (S/sec) as the number of processors ( np ) increases. For this comparison the “new...computations). It may clearly be seen that both solutions performed comparably well at low number of processors; however, as np increased, the Myrinet...has subsequently been designed, hard -coded and validated at nu modelling. Design characteristics of the code have been a) high-accuracy, b
Off the ear with no loss in speech understanding: comparing the RONDO and the OPUS 2 cochlear implant audio processors.

PubMed

Dazert, Stefan; Thomas, Jan Peter; Büchner, Andreas; Müller, Joachim; Hempel, John Martin; Löwenheim, Hubert; Mlynski, Robert

2017-03-01

The RONDO is a single-unit cochlear implant audio processor, which omits the need for a behind-the-ear (BTE) audio processor. The primary aim was to compare speech perception results in quiet and in noise with the RONDO and the OPUS 2, a BTE audio processor. Secondary aims were to determine subjects' self-assessed levels of sound quality and gather subjective feedback on RONDO use. All speech perception tests were performed with the RONDO and the OPUS 2 behind-the-ear audio processor at 3 test intervals. Subjects were required to use the RONDO between test intervals. Subjects were tested at upgrade from the OPUS 2 to the RONDO and at 1 and 6 months after upgrade. Speech perception was determined using the Freiburg Monosyllables in quiet test and the Oldenburg Sentence Test (OLSA) in noise. Subjective perception was determined using the Hearing Implant Sound Quality Index (HISQUI 19 ), and a RONDO device-specific questionnaire. 50 subjects participated in the study. Neither speech perception scores nor self-perceived sound quality scores were significantly different at any interval between the RONDO and the OPUS 2. Subjects reported high levels of satisfaction with the RONDO. The RONDO provides comparable speech perception to the OPUS 2 while providing users with high levels of satisfaction and comfort without increasing health risk. The RONDO is a suitable and safe alternative to traditional BTE audio processors.
Toshiba TDF-500 High Resolution Viewing And Analysis System

NASA Astrophysics Data System (ADS)

Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.

1988-06-01

A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
A miniature on-chip multi-functional ECG signal processor with 30 µW ultra-low power consumption.

PubMed

Liu, Xin; Zheng, Yuan Jin; Phyu, Myint Wai; Zhao, Bin; Je, Minkyu; Yuan, Xiao Jun

2010-01-01

In this paper, a miniature low-power Electrocardiogram (ECG) signal processing application specific integrated circuit (ASIC) chip is proposed. This chip provides multiple critical functions for ECG analysis using a systematic wavelet transform algorithm and a novel SRAM-based ASIC architecture, while achieves low cost and high performance. Using 0.18 µm CMOS technology and 1 V power supply, this ASIC chip consumes only 29 µW and occupies an area of 3 mm(2). This on-chip ECG processor is highly suitable for reliable real-time cardiac status monitoring applications.
Analyzing Reliability and Performance Trade-Offs of HLS-Based Designs in SRAM-Based FPGAs Under Soft Errors

NASA Astrophysics Data System (ADS)

Tambara, Lucas Antunes; Tonfat, Jorge; Santos, André; Kastensmidt, Fernanda Lima; Medina, Nilberto H.; Added, Nemitala; Aguiar, Vitor A. P.; Aguirre, Fernando; Silveira, Marcilei A. G.

2017-02-01

The increasing system complexity of FPGA-based hardware designs and shortening of time-to-market have motivated the adoption of new designing methodologies focused on addressing the current need for high-performance circuits. High-Level Synthesis (HLS) tools can generate Register Transfer Level (RTL) designs from high-level software programming languages. These tools have evolved significantly in recent years, providing optimized RTL designs, which can serve the needs of safety-critical applications that require both high performance and high reliability levels. However, a reliability evaluation of HLS-based designs under soft errors has not yet been presented. In this work, the trade-offs of different HLS-based designs in terms of reliability, resource utilization, and performance are investigated by analyzing their behavior under soft errors and comparing them to a standard processor-based implementation in an SRAM-based FPGA. Results obtained from fault injection campaigns and radiation experiments show that it is possible to increase the performance of a processor-based system up to 5,000 times by changing its architecture with a small impact in the cross section (increasing up to 8 times), and still increasing the Mean Workload Between Failures (MWBF) of the system.
High-Performance, Radiation-Hardened Electronics for Space Environments

NASA Technical Reports Server (NTRS)

Keys, Andrew S.; Watson, Michael D.; Frazier, Donald O.; Adams, James H.; Johnson, Michael A.; Kolawa, Elizabeth A.

2007-01-01

The Radiation Hardened Electronics for Space Environments (RHESE) project endeavors to advance the current state-of-the-art in high-performance, radiation-hardened electronics and processors, ensuring successful performance of space systems required to operate within extreme radiation and temperature environments. Because RHESE is a project within the Exploration Technology Development Program (ETDP), RHESE's primary customers will be the human and robotic missions being developed by NASA's Exploration Systems Mission Directorate (ESMD) in partial fulfillment of the Vision for Space Exploration. Benefits are also anticipated for NASA's science missions to planetary and deep-space destinations. As a technology development effort, RHESE provides a broad-scoped, full spectrum of approaches to environmentally harden space electronics, including new materials, advanced design processes, reconfigurable hardware techniques, and software modeling of the radiation environment. The RHESE sub-project tasks are: SelfReconfigurable Electronics for Extreme Environments, Radiation Effects Predictive Modeling, Radiation Hardened Memory, Single Event Effects (SEE) Immune Reconfigurable Field Programmable Gate Array (FPGA) (SIRF), Radiation Hardening by Software, Radiation Hardened High Performance Processors (HPP), Reconfigurable Computing, Low Temperature Tolerant MEMS by Design, and Silicon-Germanium (SiGe) Integrated Electronics for Extreme Environments. These nine sub-project tasks are managed by technical leads as located across five different NASA field centers, including Ames Research Center, Goddard Space Flight Center, the Jet Propulsion Laboratory, Langley Research Center, and Marshall Space Flight Center. The overall RHESE integrated project management responsibility resides with NASA's Marshall Space Flight Center (MSFC). Initial technology development emphasis within RHESE focuses on the hardening of Field Programmable Gate Arrays (FPGA)s and Field Programmable Analog Arrays (FPAA)s for use in reconfigurable architectures. As these component/chip level technologies mature, the RHESE project emphasis shifts to focus on efforts encompassing total processor hardening techniques and board-level electronic reconfiguration techniques featuring spare and interface modularity. This phased approach to distributing emphasis between technology developments provides hardened FPGA/FPAAs for early mission infusion, then migrates to hardened, board-level, high speed processors with associated memory elements and high density storage for the longer duration missions encountered for Lunar Outpost and Mars Exploration occurring later in the Constellation schedule.
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

NASA Astrophysics Data System (ADS)

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-04-01

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Performance of the Cell processor for biomolecular simulations

NASA Astrophysics Data System (ADS)

De Fabritiis, G.

2007-06-01

The new Cell processor represents a turning point for computing intensive applications. Here, I show that for molecular dynamics it is possible to reach an impressive sustained performance in excess of 30 Gflops with a peak of 45 Gflops for the non-bonded force calculations, over one order of magnitude faster than a single core standard processor.
Initial Performance Results on IBM POWER6

NASA Technical Reports Server (NTRS)

Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh

2008-01-01

The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.
GR712RC- Dual-Core Processor- Product Status

NASA Astrophysics Data System (ADS)

Sturesson, Fredrik; Habinc, Sandi; Gaisler, Jiri

2012-08-01

The GR712RC System-on-Chip (SoC) is a dual core LEON3FT system suitable for advanced high reliability space avionics. Fault tolerance features from Aeroflex Gaisler’s GRLIB IP library and an implementation using Ramon Chips RadSafe cell library enables superior radiation hardness.The GR712RC device has been designed to provide high processing power by including two LEON3FT 32- bit SPARC V8 processors, each with its own high- performance IEEE754 compliant floating-point-unit and SPARC reference memory management unit.This high processing power is combined with a large number of serial interfaces, ranging from high-speed links for data transfers to low-speed control buses for commanding and status acquisition.
Status of a Power Processor for the Prometheus-1 Electric Propulsion System

NASA Technical Reports Server (NTRS)

Pinero, Luis R.; Hill, Gerald M.; Aulisio, Michael; Gerber, Scott; Griebeler, Elmer; Hewitt, Frank; Scina, Joseph

2006-01-01

NASA is developing technologies for nuclear electric propulsion for proposed deep space missions in support of the Exploration initiative under Project Prometheus. Electrical power produced by the combination of a fission-based power source and a Brayton power conversion and distribution system is used by a high specific impulse ion propulsion system to propel the spaceship. The ion propulsion system include the thruster, power processor and propellant feed system. A power processor technology development effort was initiated under Project Prometheus to develop high performance and lightweight power-processing technologies suitable for the application. This effort faces multiple challenges including developing radiation hardened power modules and converters with very high power capability and efficiency to minimize the impact on the power conversion and distribution system as well as the heat rejection system. This paper documents the design and test results of the first version of the beam supply, the design of a second version of the beam supply and the design and test results of the ancillary supplies.
Hybrid Electro-Optic Processor

DTIC Science & Technology

1991-07-01

This report describes the design of a hybrid electro - optic processor to perform adaptive interference cancellation in radar systems. The processor is...modulator is reported. Included is this report is a discussion of the design, partial fabrication in the laboratory, and partial testing of the hybrid electro ... optic processor. A follow on effort is planned to complete the construction and testing of the processor. The work described in this report is the
Large-N in Volcano Settings: Volcanosri

NASA Astrophysics Data System (ADS)

Lees, J. M.; Song, W.; Xing, G.; Vick, S.; Phillips, D.

2014-12-01

We seek a paradigm shift in the approach we take on volcano monitoring where the compromise from high fidelity to large numbers of sensors is used to increase coverage and resolution. Accessibility, danger and the risk of equipment loss requires that we develop systems that are independent and inexpensive. Furthermore, rather than simply record data on hard disk for later analysis we desire a system that will work autonomously, capitalizing on wireless technology and in field network analysis. To this end we are currently producing a low cost seismic array which will incorporate, at the very basic level, seismological tools for first cut analysis of a volcano in crises mode. At the advanced end we expect to perform tomographic inversions in the network in near real time. Geophone (4 Hz) sensors connected to a low cost recording system will be installed on an active volcano where triggering earthquake location and velocity analysis will take place independent of human interaction. Stations are designed to be inexpensive and possibly disposable. In one of the first implementations the seismic nodes consist of an Arduino Due processor board with an attached Seismic Shield. The Arduino Due processor board contains an Atmel SAM3X8E ARM Cortex-M3 CPU. This 32 bit 84 MHz processor can filter and perform coarse seismic event detection on a 1600 sample signal in fewer than 200 milliseconds. The Seismic Shield contains a GPS module, 900 MHz high power mesh network radio, SD card, seismic amplifier, and 24 bit ADC. External sensors can be attached to either this 24-bit ADC or to the internal multichannel 12 bit ADC contained on the Arduino Due processor board. This allows the node to support attachment of multiple sensors. By utilizing a high-speed 32 bit processor complex signal processing tasks can be performed simultaneously on multiple sensors. Using a 10 W solar panel, second system being developed can run autonomously and collect data on 3 channels at 100Hz for 6 months with the installed 16Gb SD card. Initial designs and test results will be presented and discussed.
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Considerations for Future Climate Data Stewardship

NASA Astrophysics Data System (ADS)

Halem, M.; Nguyen, P. T.; Chapman, D. R.

2009-12-01

In this talk, we will describe the lessons learned based on processing and generating a decade of gridded AIRS and MODIS IR sounding data. We describe the challenges faced in accessing and sharing very large data sets, maintaining data provenance under evolving technologies, obtaining access to legacy calibration data and the permanent preservation of Earth science data records for on demand services. These lessons suggest a new approach to data stewardship will be required for the next decade of hyper spectral instruments combined with cloud resolving models. It will not be sufficient for stewards of future data centers to just provide the public with access to archived data but our experience indicates that data needs to reside close to computers with ultra large disc farms and tens of thousands of processors to deliver complex services on demand over very high speed networks much like the offerings of search engines today. Over the first decade of the 21st century, petabyte data records were acquired from the AIRS instrument on Aqua and the MODIS instrument on Aqua and Terra. NOAA data centers also maintain petabytes of operational IR sounders collected over the past four decades. The UMBC Multicore Computational Center (MC2) developed a Service Oriented Atmospheric Radiance gridding system (SOAR) to allow users to select IR sounding instruments from multiple archives and choose space-time- spectral periods of Level 1B data to download, grid, visualize and analyze on demand. Providing this service requires high data rate bandwidth access to the on line disks at Goddard. After 10 years, cost effective disk storage technology finally caught up with the MODIS data volume making it possible for Level 1B MODIS data to be available on line. However, 10Ge fiber optic networks to access large volumes of data are still not available from CSFC to serve the broader community. Data transfer rates are well below 10MB/s limiting their usefulness for climate studies. During this decade, processor performance hit a power wall leading computer vendors to design multicore processor chips. High performance computer systems obtained petaflop performance by clustering tens of thousands of multicore processor chips. Thus, power consumption and autonomic recovery from processor and disc failures have become major cost and technical considerations for future data archives. To address these new architecture requirements, a transparent parallel programming paradigm, the Hadoop MapReduce cloud computing system, became available as an open S/W system. In addition, the Hadoop File System and manages the distribution of data to these processors as well as backs up the processing in the event of any processor or disc failure. However, to employ this paradigm, the data needs to be stored on the computer system. We conclude this talk with a climate data preservation approach that addresses the scalability crisis to exabyte data requirements for the next decade based on projections of processor, disc data density and bandwidth doubling rates.
FANTOM: Algorithm-Architecture Codesign for High-Performance Embedded Signal and Image Processing Systems

DTIC Science & Technology

2013-05-25

graphics processors by IBM, AMD, and nVIDIA . They are between general-purpose pro- cessors and special-purpose processors. In Phase II. 3.10 Measure of...particular, Dr. Kevin Irick started a company Silicon Scapes and he has been the CEO. 5 Implications for Related/Future Research We speculate that...final project report in Jan. 2011. At the test and validation stage of the project. FANTOM’s partner at Raytheon quit from his company and hence from
Optical linear algebra processors - Architectures and algorithms

NASA Technical Reports Server (NTRS)

Casasent, David

1986-01-01

Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.
Sentinel-2 Level 2A Prototype Processor: Architecture, Algorithms And First Results

NASA Astrophysics Data System (ADS)

Muller-Wilm, Uwe; Louis, Jerome; Richter, Rudolf; Gascon, Ferran; Niezette, Marc

2013-12-01

Sen2Core is a prototype processor for Sentinel-2 Level 2A product processing and formatting. The processor is developed for and with ESA and performs the tasks of Atmospheric Correction and Scene Classification of Level 1C input data. Level 2A outputs are: Bottom-Of- Atmosphere (BOA) corrected reflectance images, Aerosol Optical Thickness-, Water Vapour-, Scene Classification maps and Quality indicators, including cloud and snow probabilities. The Level 2A Product Formatting performed by the processor follows the specification of the Level 1C User Product.

High coherence plane breaking packaging for superconducting qubits.

PubMed

Bronn, Nicholas T; Adiga, Vivekananda P; Olivadese, Salvatore B; Wu, Xian; Chow, Jerry M; Pappas, David P

2018-04-01

We demonstrate a pogo pin package for a superconducting quantum processor specifically designed with a nontrivial layout topology (e.g., a center qubit that cannot be accessed from the sides of the chip). Two experiments on two nominally identical superconducting quantum processors in pogo packages, which use commercially available parts and require modest machining tolerances, are performed at low temperature (10 mK) in a dilution refrigerator and both found to behave comparably to processors in standard planar packages with wirebonds where control and readout signals come in from the edges. Single- and two-qubit gate errors are also characterized via randomized benchmarking, exhibiting similar error rates as in standard packages, opening the possibility of integrating pogo pin packaging with extensible qubit architectures.
High coherence plane breaking packaging for superconducting qubits

NASA Astrophysics Data System (ADS)

Bronn, Nicholas T.; Adiga, Vivekananda P.; Olivadese, Salvatore B.; Wu, Xian; Chow, Jerry M.; Pappas, David P.

2018-04-01

We demonstrate a pogo pin package for a superconducting quantum processor specifically designed with a nontrivial layout topology (e.g., a center qubit that cannot be accessed from the sides of the chip). Two experiments on two nominally identical superconducting quantum processors in pogo packages, which use commercially available parts and require modest machining tolerances, are performed at low temperature (10 mK) in a dilution refrigerator and both found to behave comparably to processors in standard planar packages with wirebonds where control and readout signals come in from the edges. Single- and two-qubit gate errors are also characterized via randomized benchmarking, exhibiting similar error rates as in standard packages, opening the possibility of integrating pogo pin packaging with extensible qubit architectures.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, D.B.

1996-12-31

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Effect of poor control of film processors on mammographic image quality.

PubMed

Kimme-Smith, C; Sun, H; Bassett, L W; Gold, R H

1992-11-01

With the increasingly stringent standards of image quality in mammography, film processor quality control is especially important. Current methods are not sufficient for ensuring good processing. The authors used a sensitometer and densitometer system to evaluate the performance of 22 processors at 16 mammographic facilities. Standard sensitometric values of two films were established, and processor performance was assessed for variations from these standards. Developer chemistry of each processor was analyzed and correlated with its sensitometric values. Ten processors were retested, and nine were found to be out of calibration. The developer components of hydroquinone, sulfites, bromide, and alkalinity varied the most, and low concentrations of hydroquinone were associated with lower average gradients at two facilities. Use of the sensitometer and densitometer system helps identify out-of-calibration processors, but further study is needed to correlate sensitometric values with developer component values. The authors believe that present quality control would be improved if sensitometric or other tests could be used to identify developer components that are out of calibration.
Demonstration of two-qubit algorithms with a superconducting quantum processor.

PubMed

DiCarlo, L; Chow, J M; Gambetta, J M; Bishop, Lev S; Johnson, B R; Schuster, D I; Majer, J; Blais, A; Frunzio, L; Girvin, S M; Schoelkopf, R J

2009-07-09

Quantum computers, which harness the superposition and entanglement of physical states, could outperform their classical counterparts in solving problems with technological impact-such as factoring large numbers and searching databases. A quantum processor executes algorithms by applying a programmable sequence of gates to an initialized register of qubits, which coherently evolves into a final state containing the result of the computation. Building a quantum processor is challenging because of the need to meet simultaneously requirements that are in conflict: state preparation, long coherence times, universal gate operations and qubit readout. Processors based on a few qubits have been demonstrated using nuclear magnetic resonance, cold ion trap and optical systems, but a solid-state realization has remained an outstanding challenge. Here we demonstrate a two-qubit superconducting processor and the implementation of the Grover search and Deutsch-Jozsa quantum algorithms. We use a two-qubit interaction, tunable in strength by two orders of magnitude on nanosecond timescales, which is mediated by a cavity bus in a circuit quantum electrodynamics architecture. This interaction allows the generation of highly entangled states with concurrence up to 94 per cent. Although this processor constitutes an important step in quantum computing with integrated circuits, continuing efforts to increase qubit coherence times, gate performance and register size will be required to fulfil the promise of a scalable technology.
Systems and methods for performing wireless financial transactions

DOE Office of Scientific and Technical Information (OSTI.GOV)

McCown, Steven Harvey

2012-07-03

A secure computing module (SCM) is configured for connection with a host device. The SCM includes a processor for performing secure processing operations, a host interface for coupling the processor to the host device, and a memory connected to the processor wherein the processor logically isolates at least some of the memory from access by the host device. The SCM also includes a proximate-field wireless communicator connected to the processor to communicate with another SCM associated with another host device. The SCM generates a secure digital signature for a financial transaction package and communicates the package and the signature tomore » the other SCM using the proximate-field wireless communicator. Financial transactions are performed from person to person using the secure digital signature of each person's SCM and possibly message encryption. The digital signatures and transaction details are communicated to appropriate financial organizations to authenticate the transaction parties and complete the transaction.« less
Evolutionary Telemetry and Command Processor (TCP) architecture

NASA Technical Reports Server (NTRS)

Schneider, John R.

1992-01-01

A low cost, modular, high performance, and compact Telemetry and Command Processor (TCP) is being built as the foundation of command and data handling subsystems for the next generation of satellites. The TCP product line will support command and telemetry requirements for small to large spacecraft and from low to high rate data transmission. It is compatible with the latest TDRSS, STDN and SGLS transponders and provides CCSDS protocol communications in addition to standard TDM formats. Its high performance computer provides computing resources for hosted flight software. Layered and modular software provides common services using standardized interfaces to applications thereby enhancing software re-use, transportability, and interoperability. The TCP architecture is based on existing standards, distributed networking, distributed and open system computing, and packet technology. The first TCP application is planned for the 94 SDIO SPAS 3 mission. The architecture enhances rapid tailoring of functions thereby reducing costs and schedules developed for individual spacecraft missions.
Design and performance of a high resolution, low latency stripline beam position monitor system

NASA Astrophysics Data System (ADS)

Apsimon, R. J.; Bett, D. R.; Blaskovic Kraljevic, N.; Burrows, P. N.; Christian, G. B.; Clarke, C. I.; Constance, B. D.; Dabiri Khah, H.; Davis, M. R.; Perry, C.; Resta López, J.; Swinson, C. J.

2015-03-01

A high-resolution, low-latency beam position monitor (BPM) system has been developed for use in particle accelerators and beam lines that operate with trains of particle bunches with bunch separations as low as several tens of nanoseconds, such as future linear electron-positron colliders and free-electron lasers. The system was tested with electron beams in the extraction line of the Accelerator Test Facility at the High Energy Accelerator Research Organization (KEK) in Japan. It consists of three stripline BPMs instrumented with analogue signal-processing electronics and a custom digitizer for logging the data. The design of the analogue processor units is presented in detail, along with measurements of the system performance. The processor latency is 15.6 ±0.1 ns . A single-pass beam position resolution of 291 ±10 nm has been achieved, using a beam with a bunch charge of approximately 1 nC.
A Cost Effective System Design Approach for Critical Space Systems

NASA Technical Reports Server (NTRS)

Abbott, Larry Wayne; Cox, Gary; Nguyen, Hai

2000-01-01

NASA-JSC required an avionics platform capable of serving a wide range of applications in a cost-effective manner. In part, making the avionics platform cost effective means adhering to open standards and supporting the integration of COTS products with custom products. Inherently, operation in space requires low power, mass, and volume while retaining high performance, reconfigurability, scalability, and upgradability. The Universal Mini-Controller project is based on a modified PC/104-Plus architecture while maintaining full compatibility with standard COTS PC/104 products. The architecture consists of a library of building block modules, which can be mixed and matched to meet a specific application. A set of NASA developed core building blocks, processor card, analog input/output card, and a Mil-Std-1553 card, have been constructed to meet critical functions and unique interfaces. The design for the processor card is based on the PowerPC architecture. This architecture provides an excellent balance between power consumption and performance, and has an upgrade path to the forthcoming radiation hardened PowerPC processor. The processor card, which makes extensive use of surface mount technology, has a 166 MHz PowerPC 603e processor, 32 Mbytes of error detected and corrected RAM, 8 Mbytes of Flash, and I Mbytes of EPROM, on a single PC/104-Plus card. Similar densities have been achieved with the quad channel Mil-Std-1553 card and the analog input/output cards. The power management built into the processor and its peripheral chip allows the power and performance of the system to be adjusted to meet the requirements of the application, allowing another dimension to the flexibility of the Universal Mini-Controller. Unique mechanical packaging allows the Universal Mini-Controller to accommodate standard COTS and custom oversized PC/104-Plus cards. This mechanical packaging also provides thermal management via conductive cooling of COTS boards, which are typically designed for convection cooling methods.
CSP: A Multifaceted Hybrid Architecture for Space Computing

NASA Technical Reports Server (NTRS)

Rudolph, Dylan; Wilson, Christopher; Stewart, Jacob; Gauvin, Patrick; George, Alan; Lam, Herman; Crum, Gary Alex; Wirthlin, Mike; Wilson, Alex; Stoddard, Aaron

2014-01-01

Research on the CHREC Space Processor (CSP) takes a multifaceted hybrid approach to embedded space computing. Working closely with the NASA Goddard SpaceCube team, researchers at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Florida and Brigham Young University are developing hybrid space computers that feature an innovative combination of three technologies: commercial-off-the-shelf (COTS) devices, radiation-hardened (RadHard) devices, and fault-tolerant computing. Modern COTS processors provide the utmost in performance and energy-efficiency but are susceptible to ionizing radiation in space, whereas RadHard processors are virtually immune to this radiation but are more expensive, larger, less energy-efficient, and generations behind in speed and functionality. By featuring COTS devices to perform the critical data processing, supported by simpler RadHard devices that monitor and manage the COTS devices, and augmented with novel uses of fault-tolerant hardware, software, information, and networking within and between COTS devices, the resulting system can maximize performance and reliability while minimizing energy consumption and cost. NASA Goddard has adopted the CSP concept and technology with plans underway to feature flight-ready CSP boards on two upcoming space missions.
Future applications of associative processor systems to operational KSC systems for optimizing cost and enhancing performance characteristics

NASA Technical Reports Server (NTRS)

Perkinson, J. A.

1974-01-01

The application of associative memory processor equipment to conventional host processors type systems is discussed. Efforts were made to demonstrate how such application relieves the task burden of conventional systems, and enhance system speed and efficiency. Data cover comparative theoretical performance analysis, demonstration of expanded growth capabilities, and demonstrations of actual hardware in simulated environment.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
High performance 3D adaptive filtering for DSP based portable medical imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable medical imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. Despite their constraints on power, size and cost, portable imaging devices must still deliver high quality images. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often cannot be run with sufficient performance on a portable platform. In recent years, advanced multicore digital signal processors (DSP) have been developed that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms on a portable platform. In this study, the performance of a 3D adaptive filtering algorithm on a DSP is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec with an Ultrasound 3D probe. Relative performance and power is addressed between a reference PC (Quad Core CPU) and a TMS320C6678 DSP from Texas Instruments.
Floating-Point Modules Targeted for Use with RC Compilation Tools

NASA Technical Reports Server (NTRS)

Sahin, Ibrahin; Gloster, Clay S.

2000-01-01

Reconfigurable Computing (RC) has emerged as a viable computing solution for computationally intensive applications. Several applications have been mapped to RC system and in most cases, they provided the smallest published execution time. Although RC systems offer significant performance advantages over general-purpose processors, they require more application development time than general-purpose processors. This increased development time of RC systems provides the motivation to develop an optimized module library with an assembly language instruction format interface for use with future RC system that will reduce development time significantly. In this paper, we present area/performance metrics for several different types of floating point (FP) modules that can be utilized to develop complex FP applications. These modules are highly pipelined and optimized for both speed and area. Using these modules, and example application, FP matrix multiplication, is also presented. Our results and experiences show, that with these modules, 8-10X speedup over general-purpose processors can be achieved.
NCC Simulation Model: Simulating the operations of the network control center, phase 2

NASA Technical Reports Server (NTRS)

Benjamin, Norman M.; Paul, Arthur S.; Gill, Tepper L.

1992-01-01

The simulation of the network control center (NCC) is in the second phase of development. This phase seeks to further develop the work performed in phase one. Phase one concentrated on the computer systems and interconnecting network. The focus of phase two will be the implementation of the network message dialogues and the resources controlled by the NCC. These resources are requested, initiated, monitored and analyzed via network messages. In the NCC network messages are presented in the form of packets that are routed across the network. These packets are generated, encoded, decoded and processed by the network host processors that generate and service the message traffic on the network that connects these hosts. As a result, the message traffic is used to characterize the work done by the NCC and the connected network. Phase one of the model development represented the NCC as a network of bi-directional single server queues and message generating sources. The generators represented the external segment processors. The served based queues represented the host processors. The NCC model consists of the internal and external processors which generate message traffic on the network that links these hosts. To fully realize the objective of phase two it is necessary to identify and model the processes in each internal processor. These processes live in the operating system of the internal host computers and handle tasks such as high speed message exchanging, ISN and NFE interface, event monitoring, network monitoring, and message logging. Inter process communication is achieved through the operating system facilities. The overall performance of the host is determined by its ability to service messages generated by both internal and external processors.
Shuttle orbiter S-band payload communications equipment design evaluation

NASA Technical Reports Server (NTRS)

Springett, J. C.; Maronde, R. G.

1979-01-01

The analysis of the design, and the performance assessment of the Orbiter S-band communication equipment are reported. The equipment considered include: network transponder, network signal processor, FM transmitter, FM signal processor, payload interrogator, and payload signal processor.
A case study for the real-time experimental evaluation of the VIPER microprocessor

NASA Astrophysics Data System (ADS)

Carreno, Victor A.; Angellatta, Rob K.

1991-09-01

An experiment to evaluate the applicability of the Verifiable Integrated Processor for Enhanced Reliability (VIPER) microprocessor to real time control is described. The VIPER microprocessor was invented by the Royal Signals and Radar Establishment (RSRE), U.K., and is an example of the use of formal mathematical methods for developing electronic digital systems with a high degree of assurance on the system design and implementation correctness. The experiment consisted of selecting a control law, writing the control law algorithm for the VIPER processor, and providing real time, dynamic inputs into the processor and monitoring the outputs. The control law selected and coded for the VIPER processor was the yaw damper function of an automatic landing program for a 737 aircraft. The mechanisms for interfacing the VIPER Single Board Computer to the VAX host are described. Results include run time experiences, performance evaluation, and comparison of VIPER and FORTRAN yaw damper algorithm output for accuracy estimation.
A case study for the real-time experimental evaluation of the VIPER microprocessor

NASA Technical Reports Server (NTRS)

Carreno, Victor A.; Angellatta, Rob K.

1991-01-01

An experiment to evaluate the applicability of the Verifiable Integrated Processor for Enhanced Reliability (VIPER) microprocessor to real time control is described. The VIPER microprocessor was invented by the Royal Signals and Radar Establishment (RSRE), U.K., and is an example of the use of formal mathematical methods for developing electronic digital systems with a high degree of assurance on the system design and implementation correctness. The experiment consisted of selecting a control law, writing the control law algorithm for the VIPER processor, and providing real time, dynamic inputs into the processor and monitoring the outputs. The control law selected and coded for the VIPER processor was the yaw damper function of an automatic landing program for a 737 aircraft. The mechanisms for interfacing the VIPER Single Board Computer to the VAX host are described. Results include run time experiences, performance evaluation, and comparison of VIPER and FORTRAN yaw damper algorithm output for accuracy estimation.
Real-Time Spatio-Temporal Twice Whitening for MIMO Energy Detector

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; Mitra, Pramita; Barhen, Jacob

2010-01-01

While many techniques exist for local spectrum sensing of a primary user, each represents a computationally demanding task to secondary user receivers. In software-defined radio, computational complexity lengthens the time for a cognitive radio to recognize changes in the transmission environment. This complexity is even more significant for spatially multiplexed receivers, e.g., in SIMO and MIMO, where the spatio-temporal data sets grow in size with the number of antennae. Limits on power and space for the processor hardware further constrain SDR performance. In this report, we discuss improvements in spatio-temporal twice whitening (STTW) for real-time local spectrum sensing by demonstratingmore » a form of STTW well suited for MIMO environments. We implement STTW on the Coherent Logix hx3100 processor, a multicore processor intended for low-power, high-throughput software-defined signal processing. These results demonstrate how coupling the novel capabilities of emerging multicore processors with algorithmic advances can enable real-time, software-defined processing of large spatio-temporal data sets.« less

Computer-Aided Techniques for Providing Operator Performance Measures.

ERIC Educational Resources Information Center

Connelly, Edward M.; And Others

This report documents the theory, structure, and implementation of a performance processor (written in FORTRAN IV) that can accept performance demonstration data representing various levels of operator's skill and, under user control, analyze data to provide candidate performance measures and validation test results. The processor accepts two…
ACE: Automatic Centroid Extractor for real time target tracking

NASA Technical Reports Server (NTRS)

Cameron, K.; Whitaker, S.; Canaris, J.

1990-01-01

A high performance video image processor has been implemented which is capable of grouping contiguous pixels from a raster scan image into groups and then calculating centroid information for each object in a frame. The algorithm employed to group pixels is very efficient and is guaranteed to work properly for all convex shapes as well as most concave shapes. Processing speeds are adequate for real time processing of video images having a pixel rate of up to 20 million pixels per second. Pixels may be up to 8 bits wide. The processor is designed to interface directly to a transputer serial link communications channel with no additional hardware. The full custom VLSI processor was implemented in a 1.6 mu m CMOS process and measures 7200 mu m on a side.
The development of a power spectral density processor for C and L band airborne radar scatterometer sensor systems

NASA Technical Reports Server (NTRS)

Harrison, D. A., III; Chladek, J. T.

1983-01-01

A real-time signal processor was developed for the NASA/JSC L-and C-band airborne radar scatterometer sensor systems. The purpose of the effort was to reduce ground data processing costs. Conversion of two quadrature channels of data (like and cross polarized) was made to obtain Power Spectral Density (PSD) values. A chirp-z transform (CZT) approach was used to filter the Doppler return signal and improved high frequency and angular resolution was realized. The processors have been tested with record signals and excellent results were obtained. CZT filtering can be readily applied to scatterometers operating at other wavelengths by altering the sample frequency. The design of the hardware and software and the results of the performance tests are described in detail.
Active non-volatile memory post-processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kannan, Sudarsun; Milojicic, Dejan S.; Talwar, Vanish

A computing node includes an active Non-Volatile Random Access Memory (NVRAM) component which includes memory and a sub-processor component. The memory is to store data chunks received from a processor core, the data chunks comprising metadata indicating a type of post-processing to be performed on data within the data chunks. The sub-processor component is to perform post-processing of said data chunks based on said metadata.
Spiking neural networks on high performance computer clusters

NASA Astrophysics Data System (ADS)

Chen, Chong; Taha, Tarek M.

2011-09-01

In this paper we examine the acceleration of two spiking neural network models on three clusters of multicore processors representing three categories of processors: x86, STI Cell, and NVIDIA GPGPUs. The x86 cluster utilized consists of 352 dualcore AMD Opterons, the Cell cluster consists of 320 Sony Playstation 3s, while the GPGPU cluster contains 32 NVIDIA Tesla S1070 systems. The results indicate that the GPGPU platform can dominate in performance compared to the Cell and x86 platforms examined. From a cost perspective, the GPGPU is more expensive in terms of neuron/s throughput. If the cost of GPGPUs go down in the future, this platform will become very cost effective for these models.
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Bolosky, William Joseph

1993-01-01

Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
Green Secure Processors: Towards Power-Efficient Secure Processor Design

NASA Astrophysics Data System (ADS)

Chhabra, Siddhartha; Solihin, Yan

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.
Hiding the Disk and Network Latency of Out-of-Core Visualization

NASA Technical Reports Server (NTRS)

Ellsworth, David

2001-01-01

This paper describes an algorithm that improves the performance of application-controlled demand paging for out-of-core visualization by hiding the latency of reading data from both local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The paper includes measurements that show that the new multithreaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by two thirds. Visualization runs using data from remote disk actually ran faster than ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.
The ATLAS Level-1 Calorimeter Trigger: PreProcessor implementation and performance

NASA Astrophysics Data System (ADS)

Åsman, B.; Achenbach, R.; Allbrooke, B. M. M.; Anders, G.; Andrei, V.; Büscher, V.; Bansil, H. S.; Barnett, B. M.; Bauss, B.; Bendtz, K.; Bohm, C.; Bracinik, J.; Brawn, I. P.; Brock, R.; Buttinger, W.; Caputo, R.; Caughron, S.; Cerrito, L.; Charlton, D. G.; Childers, J. T.; Curtis, C. J.; Daniells, A. C.; Davis, A. O.; Davygora, Y.; Dorn, M.; Eckweiler, S.; Edmunds, D.; Edwards, J. P.; Eisenhandler, E.; Ellis, K.; Ermoline, Y.; Föhlisch, F.; Faulkner, P. J. W.; Fedorko, W.; Fleckner, J.; French, S. T.; Gee, C. N. P.; Gillman, A. R.; Goeringer, C.; Hülsing, T.; Hadley, D. R.; Hanke, P.; Hauser, R.; Heim, S.; Hellman, S.; Hickling, R. S.; Hidvégi, A.; Hillier, S. J.; Hofmann, J. I.; Hristova, I.; Ji, W.; Johansen, M.; Keller, M.; Khomich, A.; Kluge, E.-E.; Koll, J.; Laier, H.; Landon, M. P. J.; Lang, V. S.; Laurens, P.; Lepold, F.; Lilley, J. N.; Linnemann, J. T.; Müller, F.; Müller, T.; Mahboubi, K.; Martin, T. A.; Mass, A.; Meier, K.; Meyer, C.; Middleton, R. P.; Moa, T.; Moritz, S.; Morris, J. D.; Mudd, R. D.; Narayan, R.; zur Nedden, M.; Neusiedl, A.; Newman, P. R.; Nikiforov, A.; Ohm, C. C.; Perera, V. J. O.; Pfeiffer, U.; Plucinski, P.; Poddar, S.; Prieur, D. P. F.; Qian, W.; Rieck, P.; Rizvi, E.; Sankey, D. P. C.; Schäfer, U.; Scharf, V.; Schmitt, K.; Schröder, C.; Schultz-Coulon, H.-C.; Schumacher, C.; Schwienhorst, R.; Silverstein, S. B.; Simioni, E.; Snidero, G.; Staley, R. J.; Stamen, R.; Stock, P.; Stockton, M. C.; Tan, C. L. A.; Tapprogge, S.; Thomas, J. P.; Thompson, P. D.; Thomson, M.; True, P.; Watkins, P. M.; Watson, A. T.; Watson, M. F.; Weber, P.; Wessels, M.; Wiglesworth, C.; Williams, S. L.

2012-12-01

The PreProcessor system of the ATLAS Level-1 Calorimeter Trigger (L1Calo) receives about 7200 analogue signals from the electromagnetic and hadronic components of the calorimetric detector system. Lateral division results in cells which are pre-summed to so-called Trigger Towers of size 0.1 × 0.1 along azimuth (phi) and pseudorapidity (η). The received calorimeter signals represent deposits of transverse energy. The system consists of 124 individual PreProcessor modules that digitise the input signals for each LHC collision, and provide energy and timing information to the digital processors of the L1Calo system, which identify physics objects forming much of the basis for the full ATLAS first level trigger decision. This paper describes the architecture of the PreProcessor, its hardware realisation, functionality, and performance.
Performance of High-Reliability Space-Qualified Processors Implementing Software Defined Radios

DTIC Science & Technology

2014-03-01

ADDRESS(ES) AND ADDRESS(ES) Naval Postgraduate School, Department of Electrical and Computer Engineering, 833 Dyer Road, Monterey, CA 93943-5121 8...Chairman Jeffrey D. Paduan Electrical and Computer Engineering Dean of Research iii THIS PAGE...capability. Radiation in space poses a considerable threat to modern microelectronic devices, in particular to the high-performance low-cost computing
Supercomputing with TOUGH2 family codes for coupled multi-physics simulations of geologic carbon sequestration

NASA Astrophysics Data System (ADS)

Yamamoto, H.; Nakajima, K.; Zhang, K.; Nanai, S.

2015-12-01

Powerful numerical codes that are capable of modeling complex coupled processes of physics and chemistry have been developed for predicting the fate of CO2 in reservoirs as well as its potential impacts on groundwater and subsurface environments. However, they are often computationally demanding for solving highly non-linear models in sufficient spatial and temporal resolutions. Geological heterogeneity and uncertainties further increase the challenges in modeling works. Two-phase flow simulations in heterogeneous media usually require much longer computational time than that in homogeneous media. Uncertainties in reservoir properties may necessitate stochastic simulations with multiple realizations. Recently, massively parallel supercomputers with more than thousands of processors become available in scientific and engineering communities. Such supercomputers may attract attentions from geoscientist and reservoir engineers for solving the large and non-linear models in higher resolutions within a reasonable time. However, for making it a useful tool, it is essential to tackle several practical obstacles to utilize large number of processors effectively for general-purpose reservoir simulators. We have implemented massively-parallel versions of two TOUGH2 family codes (a multi-phase flow simulator TOUGH2 and a chemically reactive transport simulator TOUGHREACT) on two different types (vector- and scalar-type) of supercomputers with a thousand to tens of thousands of processors. After completing implementation and extensive tune-up on the supercomputers, the computational performance was measured for three simulations with multi-million grid models, including a simulation of the dissolution-diffusion-convection process that requires high spatial and temporal resolutions to simulate the growth of small convective fingers of CO2-dissolved water to larger ones in a reservoir scale. The performance measurement confirmed that the both simulators exhibit excellent scalabilities showing almost linear speedup against number of processors up to over ten thousand cores. Generally this allows us to perform coupled multi-physics (THC) simulations on high resolution geologic models with multi-million grid in a practical time (e.g., less than a second per time step).
A Low-Power High-Speed Smart Sensor Design for Space Exploration Missions

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi

1997-01-01

A low-power high-speed smart sensor system based on a large format active pixel sensor (APS) integrated with a programmable neural processor for space exploration missions is presented. The concept of building an advanced smart sensing system is demonstrated by a system-level microchip design that is composed with an APS sensor, a programmable neural processor, and an embedded microprocessor in a SOI CMOS technology. This ultra-fast smart sensor system-on-a-chip design mimics what is inherent in biological vision systems. Moreover, it is programmable and capable of performing ultra-fast machine vision processing in all levels such as image acquisition, image fusion, image analysis, scene interpretation, and control functions. The system provides about one tera-operation-per-second computing power which is a two order-of-magnitude increase over that of state-of-the-art microcomputers. Its high performance is due to massively parallel computing structures, high data throughput rates, fast learning capabilities, and advanced VLSI system-on-a-chip implementation.
Mapping of H.264 decoding on a multiprocessor architecture

NASA Astrophysics Data System (ADS)

van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.

2003-05-01

Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the overall speedup.
Measurement of fault latency in a digital avionic mini processor, part 2

NASA Technical Reports Server (NTRS)

Mcgough, J.; Swern, F.

1983-01-01

The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are described. Several earlier programs were reprogrammed, expanding the instruction set to capitalize on the full power of the BDX-930 computer. As a final demonstration of fault coverage an extensive, 3-axis, high performance flght control computation was added. The stages in the development of a CPU self-test program emphasizing the relationship between fault coverage, speed, and quantity of instructions were demonstrated.
Polymorphous computing fabric

DOEpatents

Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM

2011-01-18

Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth; Sewell, Christopher; Usher, William

Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth; Sewell, Christopher; Usher, William

Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
Systolic Processor Array For Recognition Of Spectra

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Peterson, John C.

1995-01-01

Spectral signatures of materials detected and identified quickly. Spectral Analysis Systolic Processor Array (SPA2) relatively inexpensive and satisfies need to analyze large, complex volume of multispectral data generated by imaging spectrometers to extract desired information: computational performance needed to do this in real time exceeds that of current supercomputers. Locates highly similar segments or contiguous subsegments in two different spectra at time. Compares sampled spectra from instruments with data base of spectral signatures of known materials. Computes and reports scores that express degrees of similarity between sampled and data-base spectra.
Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-Time Heterogeneous Multicores

NASA Astrophysics Data System (ADS)

Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori

Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Performance Modeling of the ADA Rendezvous

DTIC Science & Technology

1991-10-01

queueing network of figure 2, SERVERTASK can complete only one rendezvous at a time. Thus, the rate that the rendezvous requests are processed at the... Network 1, SERVERTASK competes with the traffic tasks of Server Processor. Each time SERVERTASK gains access to the processor, SERVERTASK completes...Client Processor Server Processor Software Server Nek Netork2 Figure 10. A conceptualization of the algorithm. The SERVERTASK software server of Network 2

Multicore Challenges and Benefits for High Performance Scientific Computing

DOE PAGES

Nielsen, Ida M. B.; Janssen, Curtis L.

2008-01-01

Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
System-on-chip architecture and validation for real-time transceiver optimization: APC implementation on FPGA

NASA Astrophysics Data System (ADS)

Suarez, Hernan; Zhang, Yan R.

2015-05-01

New radar applications need to perform complex algorithms and process large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression for real-time transceiver optimization are presented, they are based on a System-on-Chip architecture for Xilinx devices. This study also evaluates the performance of dedicated coprocessor as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through the high performance AXI buses, to perform floating-point operations, control the processing blocks, and communicate with external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band tested together with a low-cost channel emulator for different types of waveforms.
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
Self-checking self-repairing computer nodes using the mirror processor

NASA Technical Reports Server (NTRS)

Tamir, Yuval

1992-01-01

Circuitry added to fault-tolerant systems for concurrent error deduction usually reduces performance. Using a technique called micro rollback, it is possible to eliminate most of the performance penalty of concurrent error detection. Error detection is performed in parallel with intermodule communication, and erroneous state changes are later undone. The author reports on the design and implementation of a VLSI RISC microprocessor, called the Mirror Processor (MP), which is capable of micro rollback. In order to achieve concurrent error detection, two MP chips operate in lockstep, comparing external signals and a signature of internal signals every clock cycle. If a mismatch is detected, both processors roll back to the beginning of the cycle when the error occurred. In some cases the erroneous state is corrected by copying a value from the fault-free processor to the faulty processor. The architecture, microarchitecture, and VLSI implementation of the MP, emphasizing its error-detection, error-recovery, and self-diagnosis capabilities, are described.
Soft-core processor study for node-based architectures.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James

2008-09-01

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor builtmore » out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.« less
The ISS Water Processor Catalytic Reactor as a Post Processor for Advanced Water Reclamation Systems

NASA Technical Reports Server (NTRS)

Nalette, Tim; Snowdon, Doug; Pickering, Karen D.; Callahan, Michael

2007-01-01

Advanced water processors being developed for NASA s Exploration Initiative rely on phase change technologies and/or biological processes as the primary means of water reclamation. As a result of the phase change, volatile compounds will also be transported into the distillate product stream. The catalytic reactor assembly used in the International Space Station (ISS) water processor assembly, referred to as Volatile Removal Assembly (VRA), has demonstrated high efficiency oxidation of many of these volatile contaminants, such as low molecular weight alcohols and acetic acid, and is considered a viable post treatment system for all advanced water processors. To support this investigation, two ersatz solutions were defined to be used for further evaluation of the VRA. The first solution was developed as part of an internal research and development project at Hamilton Sundstrand (HS) and is based primarily on ISS experience related to the development of the VRA. The second ersatz solution was defined by NASA in support of a study contract to Hamilton Sundstrand to evaluate the VRA as a potential post processor for the Cascade Distillation system being developed by Honeywell. This second ersatz solution contains several low molecular weight alcohols, organic acids, and several inorganic species. A range of residence times, oxygen concentrations and operating temperatures have been studied with both ersatz solutions to provide addition performance capability of the VRA catalyst.
Development of a software interface for optical disk archival storage for a new life sciences flight experiments computer

NASA Technical Reports Server (NTRS)

Bartram, Peter N.

1989-01-01

The current Life Sciences Laboratory Equipment (LSLE) microcomputer for life sciences experiment data acquisition is now obsolete. Among the weaknesses of the current microcomputer are small memory size, relatively slow analog data sampling rates, and the lack of a bulk data storage device. While life science investigators normally prefer data to be transmitted to Earth as it is taken, this is not always possible. No down-link exists for experiments performed in the Shuttle middeck region. One important aspect of a replacement microcomputer is provision for in-flight storage of experimental data. The Write Once, Read Many (WORM) optical disk was studied because of its high storage density, data integrity, and the availability of a space-qualified unit. In keeping with the goals for a replacement microcomputer based upon commercially available components and standard interfaces, the system studied includes a Small Computer System Interface (SCSI) for interfacing the WORM drive. The system itself is designed around the STD bus, using readily available boards. Configurations examined were: (1) master processor board and slave processor board with the SCSI interface; (2) master processor with SCSI interface; (3) master processor with SCSI and Direct Memory Access (DMA); (4) master processor controlling a separate STD bus SCSI board; and (5) master processor controlling a separate STD bus SCSI board with DMA.
Accelerating molecular dynamic simulation on the cell processor and Playstation 3.

PubMed

Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S

2009-01-30

Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.
Allocating application to group of consecutive processors in fault-tolerant deadlock-free routing path defined by routers obeying same rules for path selection

DOEpatents

Leung, Vitus J [Albuquerque, NM; Phillips, Cynthia A [Albuquerque, NM; Bender, Michael A [East Northport, NY; Bunde, David P [Urbana, IL

2009-07-21

In a multiple processor computing apparatus, directional routing restrictions and a logical channel construct permit fault tolerant, deadlock-free routing. Processor allocation can be performed by creating a linear ordering of the processors based on routing rules used for routing communications between the processors. The linear ordering can assume a loop configuration, and bin-packing is applied to this loop configuration. The interconnection of the processors can be conceptualized as a generally rectangular 3-dimensional grid, and the MC allocation algorithm is applied with respect to the 3-dimensional grid.
A hierarchical, automated target recognition algorithm for a parallel analog processor

NASA Technical Reports Server (NTRS)

Woodward, Gail; Padgett, Curtis

1997-01-01

A hierarchical approach is described for an automated target recognition (ATR) system, VIGILANTE, that uses a massively parallel, analog processor (3DANN). The 3DANN processor is capable of performing 64 concurrent inner products of size 1x4096 every 250 nanoseconds.
Database for LDV Signal Processor Performance Analysis

NASA Technical Reports Server (NTRS)

Baker, Glenn D.; Murphy, R. Jay; Meyers, James F.

1989-01-01

A comparative and quantitative analysis of various laser velocimeter signal processors is difficult because standards for characterizing signal bursts have not been established. This leaves the researcher to select a signal processor based only on manufacturers' claims without the benefit of direct comparison. The present paper proposes the use of a database of digitized signal bursts obtained from a laser velocimeter under various configurations as a method for directly comparing signal processors.
Multitask neurovision processor with extensive feedback and feedforward connections

NASA Astrophysics Data System (ADS)

Gupta, Madan M.; Knopf, George K.

1991-11-01

A multi-task neuro-vision parameter which performs a variety of information processing operations associated with the early stages of biological vision is presented. The network architecture of this neuro-vision processor, called the positive-negative (PN) neural processor, is loosely based on the neural activity fields exhibited by thalamic and cortical nervous tissue layers. The computational operation performed by the processor arises from the strength of the recurrent feedback among the numerous positive and negative neural computing units. By adjusting the feedback connections it is possible to generate diverse dynamic behavior that may be used for short-term visual memory (STVM), spatio-temporal filtering (STF), and pulse frequency modulation (PFM). The information attributes that are to be processes may be regulated by modifying the feedforward connections from the signal space to the neural processor.
Event parallelism: Distributed memory parallel computing for high energy physics experiments

NASA Astrophysics Data System (ADS)

Nash, Thomas

1989-12-01

This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-02-15

We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-05-29

We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
Multitasking 3-D forward modeling using high-order finite difference methods on the Cray X-MP/416

DOE Office of Scientific and Technical Information (OSTI.GOV)

Terki-Hassaine, O.; Leiss, E.L.

1988-01-01

The CRAY X-MP/416 was used to multitask 3-D forward modeling by the high-order finite difference method. Flowtrace analysis reveals that the most expensive operation in the unitasked program is a matrix vector multiplication. The in-core and out-of-core versions of a reentrant subroutine can perform any fraction of the matrix vector multiplication independently, a pattern compatible with multitasking. The matrix vector multiplication routine can be distributed over two to four processors. The rest of the program utilizes the microtasking feature that lets the system treat independent iterations of DO-loops as subtasks to be performed by any available processor. The availability ofmore » the Solid-State Storage Device (SSD) meant the I/O wait time was virtually zero. A performance study determined a theoretical speedup, taking into account the multitasking overhead. Multitasking programs utilizing both macrotasking and microtasking features obtained actual speedups that were approximately 80% of the ideal speedup.« less
The CMS Level-1 Calorimeter Trigger for LHC Run II

NASA Astrophysics Data System (ADS)

Sinthuprasith, Tutanon

2017-01-01

The phase-1 upgrades of the CMS Level-1 calorimeter trigger have been completed. The Level-1 trigger has been fully commissioned and it will be used by CMS to collect data starting from the 2016 data run. The new trigger has been designed to improve the performance at high luminosity and large number of simultaneous inelastic collisions per crossing (pile-up). For this purpose it uses a novel design, the Time Multiplexed Design, which enables the data from an event to be processed by a single trigger processor at full granularity over several bunch crossings. The TMT design is a modular design based on the uTCA standard. The architecture is flexible and the number of trigger processors can be expanded according to the physics needs of CMS. Intelligent, more complex, and innovative algorithms are now the core of the first decision layer of CMS: the upgraded trigger system implements pattern recognition and MVA (Boosted Decision Tree) regression techniques in the trigger processors for pT assignment, pile up subtraction, and isolation requirements for electrons, and taus. The performance of the TMT design and the latency measurements and the algorithm performance which has been measured using data is also presented here.
Development and design of experiments optimization of a high temperature proton exchange membrane fuel cell auxiliary power unit with onboard fuel processor

NASA Astrophysics Data System (ADS)

Karstedt, Jörg; Ogrzewalla, Jürgen; Severin, Christopher; Pischinger, Stefan

In this work, the concept development, system layout, component simulation and the overall DOE system optimization of a HT-PEM fuel cell APU with a net electric power output of 4.5 kW and an onboard methane fuel processor are presented. A highly integrated system layout has been developed that enables fast startup within 7.5 min, a closed system water balance and high fuel processor efficiencies of up to 85% due to the recuperation of the anode offgas burner heat. The integration of the system battery into the load management enhances the transient electric performance and the maximum electric power output of the APU system. Simulation models of the carbon monoxide influence on HT-PEM cell voltage, the concentration and temperature profiles within the autothermal reformer (ATR) and the CO conversion rates within the watergas shift stages (WGSs) have been developed. They enable the optimization of the CO concentration in the anode gas of the fuel cell in order to achieve maximum system efficiencies and an optimized dimensioning of the ATR and WGS reactors. Furthermore a DOE optimization of the global system parameters cathode stoichiometry, anode stoichiometry, air/fuel ratio and steam/carbon ratio of the fuel processing system has been performed in order to achieve maximum system efficiencies for all system operating points under given boundary conditions.
Optical systolic array processor using residue arithmetic

NASA Technical Reports Server (NTRS)

Jackson, J.; Casasent, D.

1983-01-01

The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, Dario B.

1996-01-01

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

High-performance computing for airborne applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quinn, Heather M; Manuzzato, Andrea; Fairbanks, Tom

2010-06-28

Recently, there has been attempts to move common satellite tasks to unmanned aerial vehicles (UAVs). UAVs are significantly cheaper to buy than satellites and easier to deploy on an as-needed basis. The more benign radiation environment also allows for an aggressive adoption of state-of-the-art commercial computational devices, which increases the amount of data that can be collected. There are a number of commercial computing devices currently available that are well-suited to high-performance computing. These devices range from specialized computational devices, such as field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), to traditional computing platforms, such as microprocessors. Even thoughmore » the radiation environment is relatively benign, these devices could be susceptible to single-event effects. In this paper, we will present radiation data for high-performance computing devices in a accelerated neutron environment. These devices include a multi-core digital signal processor, two field-programmable gate arrays, and a microprocessor. From these results, we found that all of these devices are suitable for many airplane environments without reliability problems.« less
Recognition of Speech from the Television with Use of a Wireless Technology Designed for Cochlear Implants.

PubMed

Duke, Mila Morais; Wolfe, Jace; Schafer, Erin

2016-05-01

Cochlear implant (CI) recipients often experience difficulty understanding speech in noise and speech that originates from a distance. Many CI recipients also experience difficulty understanding speech originating from a television. Use of hearing assistance technology (HAT) may improve speech recognition in noise and for signals that originate from more than a few feet from the listener; however, there are no published studies evaluating the potential benefits of a wireless HAT designed to deliver audio signals from a television directly to a CI sound processor. The objective of this study was to compare speech recognition in quiet and in noise of CI recipients with the use of their CI alone and with the use of their CI and a wireless HAT (Cochlear Wireless TV Streamer). A two-way repeated measures design was used to evaluate performance differences obtained in quiet and in competing noise (65 dBA) with the CI sound processor alone and with the sound processor coupled to the Cochlear Wireless TV Streamer. Sixteen users of Cochlear Nucleus 24 Freedom, CI512, and CI422 implants were included in the study. Participants were evaluated in four conditions including use of the sound processor alone and use of the sound processor with the wireless streamer in quiet and in the presence of competing noise at 65 dBA. Speech recognition was evaluated in each condition with two full lists of Computer-Assisted Speech Perception Testing and Training Sentence-Level Test sentences presented from a light-emitting diode television. Speech recognition in noise was significantly better with use of the wireless streamer compared to participants' performance with their CI sound processor alone. There was also a nonsignificant trend toward better performance in quiet with use of the TV Streamer. Performance was significantly poorer when evaluated in noise compared to performance in quiet when the TV Streamer was not used. Use of the Cochlear Wireless TV Streamer designed to stream audio from a television directly to a CI sound processor provides better speech recognition in quiet and in noise when compared to performance obtained with use of the CI sound processor alone. American Academy of Audiology.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan

We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

PubMed

Zhang, Zhen; Ma, Cheng; Zhu, Rong

2017-08-23

Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.
A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System

PubMed Central

Zhang, Zhen; Zhu, Rong

2017-01-01

Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas. PMID:28832522
A Parallel Framework with Block Matrices of a Discrete Fourier Transform for Vector-Valued Discrete-Time Signals.

PubMed

Soto-Quiros, Pablo

2015-01-01

This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Parallel implementation of D-Phylo algorithm for maximum likelihood clusters.

PubMed

Malik, Shamita; Sharma, Dolly; Khatri, Sunil Kumar

2017-03-01

This study explains a newly developed parallel algorithm for phylogenetic analysis of DNA sequences. The newly designed D-Phylo is a more advanced algorithm for phylogenetic analysis using maximum likelihood approach. The D-Phylo while misusing the seeking capacity of k -means keeps away from its real constraint of getting stuck at privately conserved motifs. The authors have tested the behaviour of D-Phylo on Amazon Linux Amazon Machine Image(Hardware Virtual Machine)i2.4xlarge, six central processing unit, 122 GiB memory, 8 × 800 Solid-state drive Elastic Block Store volume, high network performance up to 15 processors for several real-life datasets. Distributing the clusters evenly on all the processors provides us the capacity to accomplish a near direct speed if there should arise an occurrence of huge number of processors.
Portable multi-node LQCD Monte Carlo simulations using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
75 FR 52507 - Submission for OMB Review; Comment Request

Federal Register 2010, 2011, 2012, 2013, 2014

2010-08-26

... standards designed to ensure that all catch delivered to the processor is accurately weighed and accounted... NMFS for catcher/processors and motherships is based on the vessel meeting a series of design criteria. Because of the wide variations in factory layout for inshore processors, NMFS requires a performance-based...
REGIONAL-SCALE (1000 KM) MODEL OF PHOTOCHEMICAL AIR POLLUTION. PART 2. INPUT PROCESSOR NETWORK DESIGN

EPA Science Inventory

Detailed specifications are given for a network of data processors and submodels that can generate the parameter fields required by the regional oxidant model formulated in Part 1 of this report. Operations performed by the processor network include simulation of the motion and d...
Integrated Advanced Microwave Sounding Unit-A(AMSU-A). Engineering Test Report: METSAT A1 Signal Processor, (P/N 1331670-2, S /N F05)

NASA Technical Reports Server (NTRS)

Lund, D.

1998-01-01

This report presents a description of the tests performed, and the test data, for the AI METSAT Signal Processor Assembly P/N 1331670-2, S/N F05. The assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive and Integration Procedure." The objective is to demonstrate functionality of the signal processor prior to instrument integration.
Integrated Advanced Microwave Sounding Unit-A (AMSU-A). Engineering Test Report: METSAT A1 Signal Processor (P/N 1331670-2, S/N F03)

NASA Technical Reports Server (NTRS)

Lund, D.

1998-01-01

This report presents a description of tests performed, and the test data, for the A1 METSAT Signal Processor Assembly PN: 1331679-2, S/N F03. This assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive Test and Integration Procedure." The objective is to demonstrate functionality of the signal processor prior to instrument integration.
SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processors with scanning CCD photodetectors

NASA Astrophysics Data System (ADS)

Esepkina, N. A.; Lavrov, A. P.; Anan'ev, M. N.; Blagodarnyi, V. S.; Ivanov, S. I.; Mansyrev, M. I.; Molodyakov, S. A.

1995-10-01

Two new types of optoelectronic radio-signal processors were investigated. Charge-coupled device (CCD) photodetectors are used in these processors under continuous scanning conditions, i.e. in a time delay and storage mode. One of these processors is based on a CCD photodetector array with a reference-signal amplitude transparency and the other is an adaptive acousto-optical signal processor with linear frequency modulation. The processor with the transparency performs multichannel discrete—analogue convolution of an input signal with a corresponding kernel of the transformation determined by the transparency. If a light source is an array of light-emitting diodes of special (stripe) geometry, the optical stages of the processor can be made from optical fibre components and the whole processor then becomes a rigid 'sandwich' (a compact hybrid optoelectronic microcircuit). A report is given also of a study of a prototype processor with optical fibre components for the reception of signals from a system with antenna aperture synthesis, which forms a radio image of the Earth.
Method for simultaneous overlapped communications between neighboring processors in a multiple

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1991-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
A new parallelization scheme for adaptive mesh refinement

DOE PAGES

Loffler, Frank; Cao, Zhoujian; Brandt, Steven R.; ...

2016-05-06

Here, we present a new method for parallelization of adaptive mesh refinement called Concurrent Structured Adaptive Mesh Refinement (CSAMR). This new method offers the lower computational cost (i.e. wall time x processor count) of subcycling in time, but with the runtime performance (i.e. smaller wall time) of evolving all levels at once using the time step of the finest level (which does more work than subcycling but has less parallelism). We demonstrate our algorithm's effectiveness using an adaptive mesh refinement code, AMSS-NCKU, and show performance on Blue Waters and other high performance clusters. For the class of problem considered inmore » this paper, our algorithm achieves a speedup of 1.7-1.9 when the processor count for a given AMR run is doubled, consistent with our theoretical predictions.« less
A new parallelization scheme for adaptive mesh refinement

DOE Office of Scientific and Technical Information (OSTI.GOV)

Loffler, Frank; Cao, Zhoujian; Brandt, Steven R.

Here, we present a new method for parallelization of adaptive mesh refinement called Concurrent Structured Adaptive Mesh Refinement (CSAMR). This new method offers the lower computational cost (i.e. wall time x processor count) of subcycling in time, but with the runtime performance (i.e. smaller wall time) of evolving all levels at once using the time step of the finest level (which does more work than subcycling but has less parallelism). We demonstrate our algorithm's effectiveness using an adaptive mesh refinement code, AMSS-NCKU, and show performance on Blue Waters and other high performance clusters. For the class of problem considered inmore » this paper, our algorithm achieves a speedup of 1.7-1.9 when the processor count for a given AMR run is doubled, consistent with our theoretical predictions.« less
Tensorial Basis Spline Collocation Method for Poisson's Equation

NASA Astrophysics Data System (ADS)

Plagne, Laurent; Berthou, Jean-Yves

2000-01-01

This paper aims to describe the tensorial basis spline collocation method applied to Poisson's equation. In the case of a localized 3D charge distribution in vacuum, this direct method based on a tensorial decomposition of the differential operator is shown to be competitive with both iterative BSCM and FFT-based methods. We emphasize the O(h4) and O(h6) convergence of TBSCM for cubic and quintic splines, respectively. We describe the implementation of this method on a distributed memory parallel machine. Performance measurements on a Cray T3E are reported. Our code exhibits high performance and good scalability: As an example, a 27 Gflops performance is obtained when solving Poisson's equation on a 2563 non-uniform 3D Cartesian mesh by using 128 T3E-750 processors. This represents 215 Mflops per processors.
Performance prediction: A case study using a multi-ring KSR-1 machine

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1995-01-01

While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
Unaligned instruction relocation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unalignedmore » ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.« less
Unaligned instruction relocation

DOEpatents

Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.; Sura, Zehra N.

2018-01-23

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unaligned ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.

Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Design and Analysis of Scheduling Policies for Real-Time Computer Systems

DTIC Science & Technology

1992-01-01

C. M. Krishna, "The Impact of Workload on the Reliability of Real-Time Processor Triads," to appear in Micro . Rel. [17] J.F. Kurose, "Performance... Processor Triads", to appear in Micro . Rel. "* J.F. Kurose. "Performance Analysis of Minimum Laxity Scheduling in Discrete Time Queue- ing Systems", to...exponentially distributed service times and deadlines. A similar model was developed for the ED policy for a single processor system under identical
Rapid recovery from transient faults in the fault-tolerant processor with fault-tolerant shared memory

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Butler, Bryan P.

1990-01-01

The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
DeepX: Deep Learning Accelerator for Restricted Boltzmann Machine Artificial Neural Networks.

PubMed

Kim, Lok-Won

2018-05-01

Although there have been many decades of research and commercial presence on high performance general purpose processors, there are still many applications that require fully customized hardware architectures for further computational acceleration. Recently, deep learning has been successfully used to learn in a wide variety of applications, but their heavy computation demand has considerably limited their practical applications. This paper proposes a fully pipelined acceleration architecture to alleviate high computational demand of an artificial neural network (ANN) which is restricted Boltzmann machine (RBM) ANNs. The implemented RBM ANN accelerator (integrating network size, using 128 input cases per batch, and running at a 303-MHz clock frequency) integrated in a state-of-the art field-programmable gate array (FPGA) (Xilinx Virtex 7 XC7V-2000T) provides a computational performance of 301-billion connection-updates-per-second and about 193 times higher performance than a software solution running on general purpose processors. Most importantly, the architecture enables over 4 times (12 times in batch learning) higher performance compared with a previous work when both are implemented in an FPGA device (XC2VP70).
Hypercluster - Parallel processing for computational mechanics

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1988-01-01

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.

PubMed

Hines, Michael; Kumar, Sameer; Schürmann, Felix

2011-01-01

For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.
Electric prototype power processor for a 30cm ion thruster

NASA Technical Reports Server (NTRS)

Biess, J. J.; Inouye, L. Y.; Schoenfeld, A. D.

1977-01-01

An electrical prototype power processor unit was designed, fabricated and tested with a 30 cm mercury ion engine for primary space propulsion. The power processor unit used the thyristor series resonant inverter as the basic power stage for the high power beam and discharge supplies. A transistorized series resonant inverter processed the remaining power for the low power outputs. The power processor included a digital interface unit to process all input commands and internal telemetry signals so that electric propulsion systems could be operated with a central computer system. The electrical prototype unit included design improvement in the power components such as thyristors, transistors, filters and resonant capacitors, and power transformers and inductors in order to reduce component weight, to minimize losses, and to control the component temperature rise. A design analysis for the electrical prototype is also presented on the component weight, losses, part count and reliability estimate. The electrical prototype was tested in a thermal vacuum environment. Integration tests were performed with a 30 cm ion engine and demonstrated operational compatibility. Electromagnetic interference data was also recorded on the design to provide information for spacecraft integration.
Scheduler for multiprocessor system switch with selective pairing

DOEpatents

Gara, Alan; Gschwind, Michael Karl; Salapura, Valentina

2015-01-06

System, method and computer program product for scheduling threads in a multiprocessing system with selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). The method configures the selective pairing facility to use checking provide one highly reliable thread for high-reliability and allocate threads to corresponding processor cores indicating need for hardware checking. The method configures the selective pairing facility to provide multiple independent cores and allocate threads to corresponding processor cores indicating inherent resilience.
A Programming Framework for Scientific Applications on CPU-GPU Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Owens, John

2013-03-24

At a high level, my research interests center around designing, programming, and evaluating computer systems that use new approaches to solve interesting problems. The rapid change of technology allows a variety of different architectural approaches to computationally difficult problems, and a constantly shifting set of constraints and trends makes the solutions to these problems both challenging and interesting. One of the most important recent trends in computing has been a move to commodity parallel architectures. This sea change is motivated by the industry’s inability to continue to profitably increase performance on a single processor and instead to move to multiplemore » parallel processors. In the period of review, my most significant work has been leading a research group looking at the use of the graphics processing unit (GPU) as a general-purpose processor. GPUs can potentially deliver superior performance on a broad range of problems than their CPU counterparts, but effectively mapping complex applications to a parallel programming model with an emerging programming environment is a significant and important research problem.« less
Electromagnetic versus electrical coupling of personal frequency modulation (FM) receivers to cochlear implant sound processors.

PubMed

Schafer, Erin C; Romine, Denise; Musgrave, Elizabeth; Momin, Sadaf; Huynh, Christy

2013-01-01

Previous research has suggested that electrically coupled frequency modulation (FM) systems substantially improved speech-recognition performance in noise in individuals with cochlear implants (CIs). However, there is limited evidence to support the use of electromagnetically coupled (neck loop) FM receivers with contemporary CI sound processors containing telecoils. The primary goal of this study was to compare speech-recognition performance in noise and subjective ratings of adolescents and adults using one of three contemporary CI sound processors coupled to electromagnetically and electrically coupled FM receivers from Oticon. A repeated-measures design was used to compare speech-recognition performance in noise and subjective ratings without and with the FM systems across three test sessions (Experiment 1) and to compare performance at different FM-gain settings (Experiment 2). Descriptive statistics were used in Experiment 3 to describe output differences measured through a CI sound processor. Experiment 1 included nine adolescents or adults with unilateral or bilateral Advanced Bionics Harmony (n = 3), Cochlear Nucleus 5 (n = 3), and MED-EL OPUS 2 (n = 3) CI sound processors. In Experiment 2, seven of the original nine participants were tested. In Experiment 3, electroacoustic output was measured from a Nucleus 5 sound processor when coupled to the electromagnetically coupled Oticon Arc neck loop and electrically coupled Oticon R2. In Experiment 1, participants completed a field trial with each FM receiver and three test sessions that included speech-recognition performance in noise and a subjective rating scale. In Experiment 2, participants were tested in three receiver-gain conditions. Results in both experiments were analyzed using repeated-measures analysis of variance. Experiment 3 involved electroacoustic-test measures to determine the monitor-earphone output of the CI alone and CI coupled to the two FM receivers. The results in Experiment 1 suggested that both FM receivers provided significantly better speech-recognition performance in noise than the CI alone; however, the electromagnetically coupled receiver provided significantly better speech-recognition performance in noise and better ratings in some situations than the electrically coupled receiver when set to the same gain. In Experiment 2, the primary analysis suggested significantly better speech-recognition performance in noise for the neck-loop versus electrically coupled receiver, but a second analysis, using the best performance across gain settings for each device, revealed no significant differences between the two FM receivers. Experiment 3 revealed monitor-earphone output differences in the Nucleus 5 sound processor for the two FM receivers when set to the +8 setting used in Experiment 1 but equal output when the electrically coupled device was set to a +16 gain setting and the electromagnetically coupled device was set to the +8 gain setting. Individuals with contemporary sound processors may show more favorable speech-recognition performance in noise electromagnetically coupled FM systems (i.e., Oticon Arc), which is most likely related to the input processing and signal processing pathway within the CI sound processor for direct input versus telecoil input. Further research is warranted to replicate these findings with a larger sample size and to develop and validate a more objective approach to fitting FM systems to CI sound processors. American Academy of Audiology.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures

DOE PAGES

Moreland, Kenneth; Sewell, Christopher; Usher, William; ...

2016-05-09

Here, one of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures

DOE PAGES

Moreland, Kenneth; Sewell, Christopher; Usher, William; ...

2016-05-09

Execution on massively threaded processors is one of the most critical challenges for high-performance computing (HPC) scientific visualization. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Moreover, our current production scientific visualization software is not designed for these new types of architectures. In order to address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
Design of an integrated fuel processor for residential PEMFCs applications

NASA Astrophysics Data System (ADS)

Seo, Yu Taek; Seo, Dong Joo; Jeong, Jin Hyeok; Yoon, Wang Lai

KIER has been developing a novel fuel processing system to provide hydrogen rich gas to residential PEMFCs system. For the effective design of a compact hydrogen production system, each unit process for steam reforming and water gas shift, has a steam generator and internal heat exchangers which are thermally and physically integrated into a single packaged hardware system. The newly designed fuel processor (prototype II) showed a thermal efficiency of 78% as a HHV basis with methane conversion of 89%. The preferential oxidation unit with two staged cascade reactors, reduces, the CO concentration to below 10 ppm without complicated temperature control hardware, which is the prerequisite CO limit for the PEMFC stack. After we achieve the initial performance of the fuel processor, partial load operation was carried out to test the performance and reliability of the fuel processor at various loads. The stability of the fuel processor was also demonstrated for three successive days with a stable composition of product gas and thermal efficiency. The CO concentration remained below 10 ppm during the test period and confirmed the stable performance of the two-stage PrOx reactors.
Efficient Use of Distributed Systems for Scientific Applications

NASA Technical Reports Server (NTRS)

Taylor, Valerie; Chen, Jian; Canfield, Thomas; Richard, Jacques

2000-01-01

Distributed computing has been regarded as the future of high performance computing. Nationwide high speed networks such as vBNS are becoming widely available to interconnect high-speed computers, virtual environments, scientific instruments and large data sets. One of the major issues to be addressed with distributed systems is the development of computational tools that facilitate the efficient execution of parallel applications on such systems. These tools must exploit the heterogeneous resources (networks and compute nodes) in distributed systems. This paper presents a tool, called PART, which addresses this issue for mesh partitioning. PART takes advantage of the following heterogeneous system features: (1) processor speed; (2) number of processors; (3) local network performance; and (4) wide area network performance. Further, different finite element applications under consideration may have different computational complexities, different communication patterns, and different element types, which also must be taken into consideration when partitioning. PART uses parallel simulated annealing to partition the domain, taking into consideration network and processor heterogeneity. The results of using PART for an explicit finite element application executing on two IBM SPs (located at Argonne National Laboratory and the San Diego Supercomputer Center) indicate an increase in efficiency by up to 36% as compared to METIS, a widely used mesh partitioning tool. The input to METIS was modified to take into consideration heterogeneous processor performance; METIS does not take into consideration heterogeneous networks. The execution times for these applications were reduced by up to 30% as compared to METIS. These results are given in Figure 1 for four irregular meshes with number of elements ranging from 30,269 elements for the Barth5 mesh to 11,451 elements for the Barth4 mesh. Future work with PART entails using the tool with an integrated application requiring distributed systems. In particular this application, illustrated in the document entails an integration of finite element and fluid dynamic simulations to address the cooling of turbine blades of a gas turbine engine design. It is not uncommon to encounter high-temperature, film-cooled turbine airfoils with 1,000,000s of degrees of freedom. This results because of the complexity of the various components of the airfoils, requiring fine-grain meshing for accuracy. Additional information is contained in the original.
Integrated Advanced Microwave Sounding Unit-A (AMSU-A). Engineering Test Report: METSAT A1 Signal Processor (P/N: 1331670-2, S/N: F04)

NASA Technical Reports Server (NTRS)

Lund, D.

1998-01-01

This report presents a description of the tests performed, and the test data, for the A1 METSAT Signal Processor Assembly PN: 1331679-2, S/N F04. The assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive Test and Integration Procedure." The objective is to demonstrate functionality of the signal processor prior to instrument integration.
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmalz, Mark S

2011-07-24

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
7 CFR 201.73 - Processors and processing of all classes of certified seed.

Code of Federal Regulations, 2010 CFR

2010-01-01

... (CONTINUED) FEDERAL SEED ACT FEDERAL SEED ACT REGULATIONS Certified Seed § 201.73 Processors and processing... of certified seed: (a) Facilities shall be available to perform processing without introducing... 7 Agriculture 3 2010-01-01 2010-01-01 false Processors and processing of all classes of certified...
7 CFR 201.73 - Processors and processing of all classes of certified seed.

Code of Federal Regulations, 2011 CFR

2011-01-01

... (CONTINUED) FEDERAL SEED ACT FEDERAL SEED ACT REGULATIONS Certified Seed § 201.73 Processors and processing... of certified seed: (a) Facilities shall be available to perform processing without introducing... 7 Agriculture 3 2011-01-01 2011-01-01 false Processors and processing of all classes of certified...
7 CFR 201.73 - Processors and processing of all classes of certified seed.

Code of Federal Regulations, 2013 CFR

2013-01-01

... (CONTINUED) FEDERAL SEED ACT FEDERAL SEED ACT REGULATIONS Certified Seed § 201.73 Processors and processing... of certified seed: (a) Facilities shall be available to perform processing without introducing... 7 Agriculture 3 2013-01-01 2013-01-01 false Processors and processing of all classes of certified...
7 CFR 201.73 - Processors and processing of all classes of certified seed.

Code of Federal Regulations, 2012 CFR

2012-01-01

... (CONTINUED) FEDERAL SEED ACT FEDERAL SEED ACT REGULATIONS Certified Seed § 201.73 Processors and processing... of certified seed: (a) Facilities shall be available to perform processing without introducing... 7 Agriculture 3 2012-01-01 2012-01-01 false Processors and processing of all classes of certified...

7 CFR 201.73 - Processors and processing of all classes of certified seed.

Code of Federal Regulations, 2014 CFR

2014-01-01

... (CONTINUED) FEDERAL SEED ACT FEDERAL SEED ACT REGULATIONS Certified Seed § 201.73 Processors and processing... of certified seed: (a) Facilities shall be available to perform processing without introducing... 7 Agriculture 3 2014-01-01 2014-01-01 false Processors and processing of all classes of certified...
On board processor development for NASA's spaceborne imaging radar with system-on-chip technology

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi

2004-01-01

This paper reports a preliminary study result of an on-board spaceborne SAR processor. It consists of a processing requirement analysis, functional specifications, and implementation with system-on-chip technology. Finally, a minimum version of this on-board processor designed for performance evaluation and for partial demonstration is illustrated.
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE PAGES

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; ...

2017-06-01

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel

As on-node parallelism increases and the performance gap between the processor and the memory system widens, achieving high performance in large-scale scientific applications requires an architecture-aware design of algorithms and solvers. We focus on the eigenvalue problem arising in nuclear Configuration Interaction (CI) calculations, where a few extreme eigenpairs of a sparse symmetric matrix are needed. Here, we consider a block iterative eigensolver whose main computational kernels are the multiplication of a sparse matrix with multiple vectors (SpMM), and tall-skinny matrix operations. We then present techniques to significantly improve the SpMM and the transpose operation SpMM T by using themore » compressed sparse blocks (CSB) format. We achieve 3-4× speedup on the requisite operations over good implementations with the commonly used compressed sparse row (CSR) format. We develop a performance model that allows us to correctly estimate the performance of our SpMM kernel implementations, and we identify cache bandwidth as a potential performance bottleneck beyond DRAM. We also analyze and optimize the performance of LOBPCG kernels (inner product and linear combinations on multiple vectors) and show up to 15× speedup over using high performance BLAS libraries for these operations. The resulting high performance LOBPCG solver achieves 1.4× to 1.8× speedup over the existing Lanczos solver on a series of CI computations on high-end multicore architectures (Intel Xeons). We also analyze the performance of our techniques on an Intel Xeon Phi Knights Corner (KNC) processor.« less
High flux heat exchanger

NASA Astrophysics Data System (ADS)

Flynn, Edward M.; Mackowski, Michael J.

1993-01-01

This interim report documents the results of the first two phases of a four-phase program to develop a high flux heat exchanger for cooling future high performance aircraft electronics. Phase 1 defines future needs for high flux heat removal in advanced military electronics systems. The results are sorted by broad application categories: (1) commercial digital systems, (2) military data processors, (3) power processors, and (4) radar and optical systems. For applications expected to be fielded in five to ten years, the outlook is for steady state flux levels of 30-50 W/sq cm for digital processors and several hundred W/sq cm for power control applications. In Phase 1, a trade study was conducted on emerging cooling technologies which could remove a steady state chip heat flux of 100 W/sq cm while holding chip junction temperature to 90 C. Constraints imposed on heat exchanger design, in order to reflect operation in a fighter aircraft environment, included a practical lower limit on coolant supply temperature, the preference for a nontoxic, nonflammable, and nonfreezing coolant, the need to minimize weight and volume, and operation in an accelerating environment. The trade study recommended the Compact High Intensity Cooler (CHIC) for design, fabrication, and test in the final two phases of this program.
Case Study of Using High Performance Commercial Processors in Space

NASA Technical Reports Server (NTRS)

Ferguson, Roscoe C.; Olivas, Zulema

2009-01-01

The purpose of the Space Shuttle Cockpit Avionics Upgrade project (1999 2004) was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the re-evaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s were radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but had some ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
Case Study of Using High Performance Commercial Processors in a Space Environment

NASA Technical Reports Server (NTRS)

Ferguson, Roscoe C.; Olivas, Zulema

2009-01-01

The purpose of the Space Shuttle Cockpit Avionics Upgrade project was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the reevaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s where radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but faired better than the 7400 in the ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
Spacecraft on-board SAR image generation for EOS-type missions

NASA Technical Reports Server (NTRS)

Liu, K. Y.; Arens, W. E.; Assal, H. M.; Vesecky, J. F.

1987-01-01

Spacecraft on-board synthetic aperture radar (SAR) image generation is an extremely difficult problem because of the requirements for high computational rates (usually on the order of Giga-operations per second), high reliability (some missions last up to 10 years), and low power dissipation and mass (typically less than 500 watts and 100 Kilograms). Recently, a JPL study was performed to assess the feasibility of on-board SAR image generation for EOS-type missions. This paper summarizes the results of that study. Specifically, it proposes a processor architecture using a VLSI time-domain parallel array for azimuth correlation. Using available space qualifiable technology to implement the proposed architecture, an on-board SAR processor having acceptable power and mass characteristics appears feasible for EOS-type applications.
Watchdog activity monitor (WAM) for use wth high coverage processor self-test

NASA Technical Reports Server (NTRS)

Tulpule, Bhalchandra R. (Inventor); Crosset, III, Richard W. (Inventor); Versailles, Richard E. (Inventor)

1988-01-01

A high fault coverage, instruction modeled self-test for a signal processor in a user environment is disclosed. The self-test executes a sequence of sub-tests and issues a state transition signal upon the execution of each sub-test. The self-test may be combined with a watchdog activity monitor (WAM) which provides a test-failure signal in the presence of a counted number of state transitions not agreeing with an expected number. An independent measure of time may be provided in the WAM to increase fault coverage by checking the processor's clock. Additionally, redundant processor systems are protected from inadvertent unsevering of a severed processor using a unique unsever arming technique and apparatus.
Parallel network simulations with NEURON.

PubMed

Migliore, M; Cannia, C; Lytton, W W; Markram, Henry; Hines, M L

2006-10-01

The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2,000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored.
PCI-based WILDFIRE reconfigurable computing engines

NASA Astrophysics Data System (ADS)

Fross, Bradley K.; Donaldson, Robert L.; Palmer, Douglas J.

1996-10-01

WILDFORCE is the first PCI-based custom reconfigurable computer that is based on the Splash 2 technology transferred from the National Security Agency and the Institute for Defense Analyses, Supercomputing Research Center (SRC). The WILDFORCE architecture has many of the features of the WILDFIRE computer, such as field- programmable gate array (FPGA) based processing elements, linear array and crossbar interconnection, and high- performance memory and I/O subsystems. New features introduced in the PCI-based WILDFIRE systems include memory/processor options that can be added to any processing element. These options include static and dynamic memory, digital signal processors (DSPs), FPGAs, and microprocessors. In addition to memory/processor options, many different application specific connectors can be used to extend the I/O capabilities of the system, including systolic I/O, camera input and video display output. This paper also discusses how this new PCI-based reconfigurable computing engine is used for rapid-prototyping, real-time video processing and other DSP applications.
Parallel Network Simulations with NEURON

PubMed Central

Migliore, M.; Cannia, C.; Lytton, W.W; Markram, Henry; Hines, M. L.

2009-01-01

The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored. PMID:16732488
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
Optical laboratory solution and error model simulation of a linear time-varying finite element equation

NASA Technical Reports Server (NTRS)

Taylor, B. K.; Casasent, D. P.

1989-01-01

The use of simplified error models to accurately simulate and evaluate the performance of an optical linear-algebra processor is described. The optical architecture used to perform banded matrix-vector products is reviewed, along with a linear dynamic finite-element case study. The laboratory hardware and ac-modulation technique used are presented. The individual processor error-source models and their simulator implementation are detailed. Several significant simplifications are introduced to ease the computational requirements and complexity of the simulations. The error models are verified with a laboratory implementation of the processor, and are used to evaluate its potential performance.
Media processors using a new microsystem architecture designed for the Internet era

NASA Astrophysics Data System (ADS)

Wyland, David C.

1999-12-01

The demands of digital image processing, communications and multimedia applications are growing more rapidly than traditional design methods can fulfill them. Previously, only custom hardware designs could provide the performance required to meet the demands of these applications. However, hardware design has reached a crisis point. Hardware design can no longer deliver a product with the required performance and cost in a reasonable time for a reasonable risk. Software based designs running on conventional processors can deliver working designs in a reasonable time and with low risk but cannot meet the performance requirements. What is needed is a media processing approach that combines very high performance, a simple programming model, complete programmability, short time to market and scalability. The Universal Micro System (UMS) is a solution to these problems. The UMS is a completely programmable (including I/O) system on a chip that combines hardware performance with the fast time to market, low cost and low risk of software designs.
Analog Processor To Solve Optimization Problems

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

1993-01-01

Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.
High-resolution 3D simulations of NIF ignition targets performed on Sequoia with HYDRA

NASA Astrophysics Data System (ADS)

Marinak, M. M.; Clark, D. S.; Jones, O. S.; Kerbel, G. D.; Sepke, S.; Patel, M. V.; Koning, J. M.; Schroeder, C. R.

2015-11-01

Developments in the multiphysics ICF code HYDRA enable it to perform large-scale simulations on the Sequoia machine at LLNL. With an aggregate computing power of 20 Petaflops, Sequoia offers an unprecedented capability to resolve the physical processes in NIF ignition targets for a more complete, consistent treatment of the sources of asymmetry. We describe modifications to HYDRA that enable it to scale to over one million processes on Sequoia. These include new options for replicating parts of the mesh over a subset of the processes, to avoid strong scaling limits. We consider results from a 3D full ignition capsule-only simulation performed using over one billion zones run on 262,000 processors which resolves surface perturbations through modes l = 200. We also report progress towards a high-resolution 3D integrated hohlraum simulation performed using 262,000 processors which resolves surface perturbations on the ignition capsule through modes l = 70. These aim for the most complete calculations yet of the interactions and overall impact of the various sources of asymmetry for NIF ignition targets. This work was performed under the auspices of the Lawrence Livermore National Security, LLC, (LLNS) under Contract No. DE-AC52-07NA27344.
Eigensolution of finite element problems in a completely connected parallel architecture

NASA Technical Reports Server (NTRS)

Akl, F.; Morel, M.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.
Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application

NASA Technical Reports Server (NTRS)

Tennille, Geoffrey M.; Overman, Andrea L.; Lambiotte, Jules J.; Streett, Craig L.

1991-01-01

The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs.
Effective Vectorization with OpenMP 4.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less

Software-defined reconfigurable microwave photonics processor.

PubMed

Pérez, Daniel; Gasulla, Ivana; Capmany, José

2015-06-01

We propose, for the first time to our knowledge, a software-defined reconfigurable microwave photonics signal processor architecture that can be integrated on a chip and is capable of performing all the main functionalities by suitable programming of its control signals. The basic configuration is presented and a thorough end-to-end design model derived that accounts for the performance of the overall processor taking into consideration the impact and interdependencies of both its photonic and RF parts. We demonstrate the model versatility by applying it to several relevant application examples.
Approaches in highly parameterized inversion: TSPROC, a general time-series processor to assist in model calibration and result summarization

USGS Publications Warehouse

Westenbroek, Stephen M.; Doherty, John; Walker, John F.; Kelson, Victor A.; Hunt, Randall J.; Cera, Timothy B.

2012-01-01

The TSPROC (Time Series PROCessor) computer software uses a simple scripting language to process and analyze time series. It was developed primarily to assist in the calibration of environmental models. The software is designed to perform calculations on time-series data commonly associated with surface-water models, including calculation of flow volumes, transformation by means of basic arithmetic operations, and generation of seasonal and annual statistics and hydrologic indices. TSPROC can also be used to generate some of the key input files required to perform parameter optimization by means of the PEST (Parameter ESTimation) computer software. Through the use of TSPROC, the objective function for use in the model-calibration process can be focused on specific components of a hydrograph.
Analysis of the packet formation process in packet-switched networks

NASA Astrophysics Data System (ADS)

Meditch, J. S.

Two new queueing system models for the packet formation process in packet-switched telecommunication networks are developed, and their applications in process stability, performance analysis, and optimization studies are illustrated. The first, an M/M/1 queueing system characterization of the process, is a highly aggregated model which is useful for preliminary studies. The second, a marked extension of an earlier M/G/1 model, permits one to investigate stability, performance characteristics, and design of the packet formation process in terms of the details of processor architecture, and hardware and software implementations with processor structure and as many parameters as desired as variables. The two new models together with the earlier M/G/1 characterization span the spectrum of modeling complexity for the packet formation process from basic to advanced.
A complexity-scalable software-based MPEG-2 video encoder.

PubMed

Chen, Guo-bin; Lu, Xin-ning; Wang, Xing-guo; Liu, Ji-lin

2004-05-01

With the development of general-purpose processors (GPP) and video signal processing algorithms, it is possible to implement a software-based real-time video encoder on GPP, and its low cost and easy upgrade attract developers' interests to transfer video encoding from specialized hardware to more flexible software. In this paper, the encoding structure is set up first to support complexity scalability; then a lot of high performance algorithms are used on the key time-consuming modules in coding process; finally, at programming level, processor characteristics are considered to improve data access efficiency and processing parallelism. Other programming methods such as lookup table are adopted to reduce the computational complexity. Simulation results showed that these ideas could not only improve the global performance of video coding, but also provide great flexibility in complexity regulation.
Benchmarking NWP Kernels on Multi- and Many-core Processors

NASA Astrophysics Data System (ADS)

Michalakes, J.; Vachharajani, M.

2008-12-01

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Peregrine System Configuration | High-Performance Computing | NREL

Science.gov Websites

nodes and storage are connected by a high speed InfiniBand network. Compute nodes are diskless with an directories are mounted on all nodes, along with a file system dedicated to shared projects. A brief processors with 64 GB of memory. All nodes are connected to the high speed Infiniband network and and a
High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures

NASA Technical Reports Server (NTRS)

Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

1985-01-01

Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.
High Fidelity Simulations of Unsteady Flow through Turbopumps and Flowliners

NASA Technical Reports Server (NTRS)

Kiris, Cetin C.; Kwak, dochan; Chan, William; Housman, Jeff

2006-01-01

High fidelity computations were carried out to analyze the orbiter LH2 feedline flowliner. Computations were performed on the Columbia platform which is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processor each. Various computational models were used to characterize the unsteady flow features in the turbopump, including the orbiter Low-Pressure-Fuel-Turbopump (LPFTP) inducer, the orbiter manifold and a test article used to represent the manifold. Unsteady flow originating from the orbiter LPFTP inducer is one of the major contributors to the high frequency cyclic loading that results in high cycle fatigue damage to the gimbal flowliners just upstream of the LPFTP. The flow fields for the orbiter manifold and representative test article are computed and analyzed for similarities and differences. The incompressible Navier-Stokes flow solver INS3D, based on the artificial compressibility method, was used to compute the flow of liquid hydrogen in each test article.
Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1994-01-01

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Low-Latency Embedded Vision Processor (LLEVS)

DTIC Science & Technology

2016-03-01

26 3.2.3 Task 3 Projected Performance Analysis of FPGA- based Vision Processor ........... 31 3.2.3.1 Algorithms Latency Analysis ...Programmable Gate Array Custom Hardware for Real- Time Multiresolution Analysis . ............................................... 35...conduct data analysis for performance projections. The data acquired through measurements , simulation and estimation provide the requisite platform for
Adaptive Optoelectronic Eyes: Hybrid Sensor/Processor Architectures

DTIC Science & Technology

2006-11-13

corresponding calculated data. The width of the mirror stopband is proportional to the refractive index difference between the high and low index materials ...Silicon VLSI Neuron Unit Arrays 56 Development of a Single-Sided Flip-Chip Bonding Process 65 Development of High Refractive Index Diffractive Optical ...Elements (DOEs) 68 Development of High-Performance Antireflection Coatings for High Refractive Index DOEs 69 Design and Fabrication of Low Threshold
Mapping of MPEG-4 decoding on a flexible architecture platform

NASA Astrophysics Data System (ADS)

van der Tol, Erik B.; Jaspers, Egbert G.

2001-12-01

In the field of consumer electronics, the advent of new features such as Internet, games, video conferencing, and mobile communication has triggered the convergence of television and computers technologies. This requires a generic media-processing platform that enables simultaneous execution of very diverse tasks such as high-throughput stream-oriented data processing and highly data-dependent irregular processing with complex control flows. As a representative application, this paper presents the mapping of a Main Visual profile MPEG-4 for High-Definition (HD) video onto a flexible architecture platform. A stepwise approach is taken, going from the decoder application toward an implementation proposal. First, the application is decomposed into separate tasks with self-contained functionality, clear interfaces, and distinct characteristics. Next, a hardware-software partitioning is derived by analyzing the characteristics of each task such as the amount of inherent parallelism, the throughput requirements, the complexity of control processing, and the reuse potential over different applications and different systems. Finally, a feasible implementation is proposed that includes amongst others a very-long-instruction-word (VLIW) media processor, one or more RISC processors, and some dedicated processors. The mapping study of the MPEG-4 decoder proves the flexibility and extensibility of the media-processing platform. This platform enables an effective HW/SW co-design yielding a high performance density.
Fast generation of computer-generated hologram by graphics processing unit

NASA Astrophysics Data System (ADS)

Matsuda, Sho; Fujii, Tomohiko; Yamaguchi, Takeshi; Yoshikawa, Hiroshi

2009-02-01

A cylindrical hologram is well known to be viewable in 360 deg. This hologram depends high pixel resolution.Therefore, Computer-Generated Cylindrical Hologram (CGCH) requires huge calculation amount.In our previous research, we used look-up table method for fast calculation with Intel Pentium4 2.8 GHz.It took 480 hours to calculate high resolution CGCH (504,000 x 63,000 pixels and the average number of object points are 27,000).To improve quality of CGCH reconstructed image, fringe pattern requires higher spatial frequency and resolution.Therefore, to increase the calculation speed, we have to change the calculation method. In this paper, to reduce the calculation time of CGCH (912,000 x 108,000 pixels), we employ Graphics Processing Unit (GPU).It took 4,406 hours to calculate high resolution CGCH on Xeon 3.4 GHz.Since GPU has many streaming processors and a parallel processing structure, GPU works as the high performance parallel processor.In addition, GPU gives max performance to 2 dimensional data and streaming data.Recently, GPU can be utilized for the general purpose (GPGPU).For example, NVIDIA's GeForce7 series became a programmable processor with Cg programming language.Next GeForce8 series have CUDA as software development kit made by NVIDIA.Theoretically, calculation ability of GPU is announced as 500 GFLOPS. From the experimental result, we have achieved that 47 times faster calculation compared with our previous work which used CPU.Therefore, CGCH can be generated in 95 hours.So, total time is 110 hours to calculate and print the CGCH.
Mapping a battlefield simulation onto message-passing parallel architectures

NASA Technical Reports Server (NTRS)

Nicol, David M.

1987-01-01

Perhaps the most critical problem in distributed simulation is that of mapping: without an effective mapping of workload to processors the speedup potential of parallel processing cannot be realized. Mapping a simulation onto a message-passing architecture is especially difficult when the computational workload dynamically changes as a function of time and space; this is exactly the situation faced by battlefield simulations. This paper studies an approach where the simulated battlefield domain is first partitioned into many regions of equal size; typically there are more regions than processors. The regions are then assigned to processors; a processor is responsible for performing all simulation activity associated with the regions. The assignment algorithm is quite simple and attempts to balance load by exploiting locality of workload intensity. The performance of this technique is studied on a simple battlefield simulation implemented on the Flex/32 multiprocessor. Measurements show that the proposed method achieves reasonable processor efficiencies. Furthermore, the method shows promise for use in dynamic remapping of the simulation.
Design of a dataway processor for a parallel image signal processing system

NASA Astrophysics Data System (ADS)

Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

1995-04-01

Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.
Research in the design of high-performance reconfigurable systems

NASA Technical Reports Server (NTRS)

Slotnick, D. L.; Mcewan, S. D.; Spry, A. J.

1984-01-01

An initial design for the Bit Processor (BP) referred to in prior reports as the Processing Element or PE has been completed. Eight BP's, together with their supporting random-access memory, a 64 k x 9 ROM to perform addition, routing logic, and some additional logic, constitute the components of a single stage. An initial stage design is given. Stages may be combined to perform high-speed fixed or floating point arithmetic. Stages can be configured into a range of arithmetic modules that includes bit-serial one or two-dimensional arrays; one or two dimensional arrays fixed or floating point processors; and specialized uniprocessors, such as long-word arithmetic units. One to eight BP's represent a likely initial chip level. The Stage would then correspond to a first-level pluggable module. As both this project and VLSI CAD/CAM progress, however, it is expected that the chip level would migrate upward to the stage and, perhaps, ultimately the box level. The BP RAM, consisting of two banks, holds only operands and indices. Programs are at the box (high-level function) and system level. At the system level initial effort has been concentrated on specifying the tools needed to evaluate design alternatives.
AltiVec performance increases for autonomous robotics for the MARSSCAPE architecture program

NASA Astrophysics Data System (ADS)

Gothard, Benny M.

2002-02-01

One of the main tall poles that must be overcome to develop a fully autonomous vehicle is the inability of the computer to understand its surrounding environment to a level that is required for the intended task. The military mission scenario requires a robot to interact in a complex, unstructured, dynamic environment. Reference A High Fidelity Multi-Sensor Scene Understanding System for Autonomous Navigation The Mobile Autonomous Robot Software Self Composing Adaptive Programming Environment (MarsScape) perception research addresses three aspects of the problem; sensor system design, processing architectures, and algorithm enhancements. A prototype perception system has been demonstrated on robotic High Mobility Multi-purpose Wheeled Vehicle and All Terrain Vehicle testbeds. This paper addresses the tall pole of processing requirements and the performance improvements based on the selected MarsScape Processing Architecture. The processor chosen is the Motorola Altivec-G4 Power PC(PPC) (1998 Motorola, Inc.), a highly parallized commercial Single Instruction Multiple Data processor. Both derived perception benchmarks and actual perception subsystems code will be benchmarked and compared against previous Demo II-Semi-autonomous Surrogate Vehicle processing architectures along with desktop Personal Computers(PC). Performance gains are highlighted with progress to date, and lessons learned and future directions are described.
Accuracy of the lattice-Boltzmann method using the Cell processor

NASA Astrophysics Data System (ADS)

Harvey, M. J.; de Fabritiis, G.; Giupponi, G.

2008-11-01

Accelerator processors like the new Cell processor are extending the traditional platforms for scientific computation, allowing orders of magnitude more floating-point operations per second (flops) compared to standard central processing units. However, they currently lack double-precision support and support for some IEEE 754 capabilities. In this work, we develop a lattice-Boltzmann (LB) code to run on the Cell processor and test the accuracy of this lattice method on this platform. We run tests for different flow topologies, boundary conditions, and Reynolds numbers in the range Re=6 350 . In one case, simulation results show a reduced mass and momentum conservation compared to an equivalent double-precision LB implementation. All other cases demonstrate the utility of the Cell processor for fluid dynamics simulations. Benchmarks on two Cell-based platforms are performed, the Sony Playstation3 and the QS20/QS21 IBM blade, obtaining a speed-up factor of 7 and 21, respectively, compared to the original PC version of the code, and a conservative sustained performance of 28 gigaflops per single Cell processor. Our results suggest that choice of IEEE 754 rounding mode is possibly as important as double-precision support for this specific scientific application.
Method for prefetching non-contiguous data structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-05-05

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Numerical simulation of air hypersonic flows with equilibrium chemical reactions

NASA Astrophysics Data System (ADS)

Emelyanov, Vladislav; Karpenko, Anton; Volkov, Konstantin

2018-05-01

The finite volume method is applied to solve unsteady three-dimensional compressible Navier-Stokes equations on unstructured meshes. High-temperature gas effects altering the aerodynamics of vehicles are taken into account. Possibilities of the use of graphics processor units (GPUs) for the simulation of hypersonic flows are demonstrated. Solutions of some test cases on GPUs are reported, and a comparison between computational results of equilibrium chemically reacting and perfect air flowfields is performed. Speedup of solution on GPUs with respect to the solution on central processor units (CPUs) is compared. The results obtained provide promising perspective for designing a GPU-based software framework for practical applications.

The architecture of a video image processor for the space station

NASA Technical Reports Server (NTRS)

Yalamanchili, S.; Lee, D.; Fritze, K.; Carpenter, T.; Hoyme, K.; Murray, N.

1987-01-01

The architecture of a video image processor for space station applications is described. The architecture was derived from a study of the requirements of algorithms that are necessary to produce the desired functionality of many of these applications. Architectural options were selected based on a simulation of the execution of these algorithms on various architectural organizations. A great deal of emphasis was placed on the ability of the system to evolve and grow over the lifetime of the space station. The result is a hierarchical parallel architecture that is characterized by high level language programmability, modularity, extensibility and can meet the required performance goals.
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
Reduced power processor requirements for the 30-cm diameter HG ion thruster

NASA Technical Reports Server (NTRS)

Rawlin, V. K.

1979-01-01

The characteristics of power processors strongly impact the overall performance and cost of electric propulsion systems. A program was initiated to evaluate simplifications of the thruster-power processor interface requirements. The power processor requirements are mission dependent with major differences arising for those missions which require a nearly constant thruster operating point (typical of geocentric and some inbound planetary missions) and those requiring operation over a large range of input power (such as outbound planetary missions). This paper describes the results of tests which have indicated that as many as seven of the twelve power supplies may be eliminated from the present Functional Model Power Processor used with 30-cm diameter Hg ion thrusters.
Metal membrane-type 25-kW methanol fuel processor for fuel-cell hybrid vehicle

NASA Astrophysics Data System (ADS)

Han, Jaesung; Lee, Seok-Min; Chang, Hyuksang

A 25-kW on-board methanol fuel processor has been developed. It consists of a methanol steam reformer, which converts methanol to hydrogen-rich gas mixture, and two metal membrane modules, which clean-up the gas mixture to high-purity hydrogen. It produces hydrogen at rates up to 25 N m 3/h and the purity of the product hydrogen is over 99.9995% with a CO content of less than 1 ppm. In this fuel processor, the operating condition of the reformer and the metal membrane modules is nearly the same, so that operation is simple and the overall system construction is compact by eliminating the extensive temperature control of the intermediate gas streams. The recovery of hydrogen in the metal membrane units is maintained at 70-75% by the control of the pressure in the system, and the remaining 25-30% hydrogen is recycled to a catalytic combustion zone to supply heat for the methanol steam-reforming reaction. The thermal efficiency of the fuel processor is about 75% and the inlet air pressure is as low as 4 psi. The fuel processor is currently being integrated with 25-kW polymer electrolyte membrane fuel-cell (PEMFC) stack developed by the Hyundai Motor Company. The stack exhibits the same performance as those with pure hydrogen, which proves that the maximum power output as well as the minimum stack degradation is possible with this fuel processor. This fuel-cell 'engine' is to be installed in a hybrid passenger vehicle for road testing.
Supercomputing '91; Proceedings of the 4th Annual Conference on High Performance Computing, Albuquerque, NM, Nov. 18-22, 1991

NASA Technical Reports Server (NTRS)

1991-01-01

Various papers on supercomputing are presented. The general topics addressed include: program analysis/data dependence, memory access, distributed memory code generation, numerical algorithms, supercomputer benchmarks, latency tolerance, parallel programming, applications, processor design, networks, performance tools, mapping and scheduling, characterization affecting performance, parallelism packaging, computing climate change, combinatorial algorithms, hardware and software performance issues, system issues. (No individual items are abstracted in this volume)
Implementation and Assessment of Advanced Analog Vector-Matrix Processor

NASA Technical Reports Server (NTRS)

Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the processor is compared with that required for a range of typical problems. Calculations resolving alloy concentrations from spectral plume data of rocket engines are implemented on the optical processor, demonstrating its sufficiency for this problem. We also show how this technology can be easily extended to a 100 x 100 10 MHz (200 Cops) processor.
Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture

NASA Astrophysics Data System (ADS)

Tanikawa, Ataru; Yoshikawa, Kohji; Nitadori, Keigo; Okamoto, Takashi

2013-02-01

We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
Broadband set-top box using MAP-CA processor

NASA Astrophysics Data System (ADS)

Bush, John E.; Lee, Woobin; Basoglu, Chris

2001-12-01

Advances in broadband access are expected to exert a profound impact in our everyday life. It will be the key to the digital convergence of communication, computer and consumer equipment. A common thread that facilitates this convergence comprises digital media and Internet. To address this market, Equator Technologies, Inc., is developing the Dolphin broadband set-top box reference platform using its MAP-CA Broadband Signal ProcessorT chip. The Dolphin reference platform is a universal media platform for display and presentation of digital contents on end-user entertainment systems. The objective of the Dolphin reference platform is to provide a complete set-top box system based on the MAP-CA processor. It includes all the necessary hardware and software components for the emerging broadcast and the broadband digital media market based on IP protocols. Such reference design requires a broadband Internet access and high-performance digital signal processing. By using the MAP-CA processor, the Dolphin reference platform is completely programmable, allowing various codecs to be implemented in software, such as MPEG-2, MPEG-4, H.263 and proprietary codecs. The software implementation also enables field upgrades to keep pace with evolving technology and industry demands.
Towards the formal specification of the requirements and design of a processor interface unit: HOL listings

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

This technical report contains the HOL listings of the specification of the design and major portions of the requirements for a commercially developed processor interface unit (or PIU). The PIU is an interface chip performing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. This report contains the actual HOL listings of the PIU specification as it currently exists. Section two of this report contains general-purpose HOL theories that support the PIU specification. These theories include definitions for the hardware components used in the PIU, our implementation of bit words, and our implementation of temporal logic. Section three contains the HOL listings for the PIU design specification. Aside from the PIU internal bus (I-Bus), this specification is complete. Section four contains the HOL listings for a major portion of the PIU requirements specification. Specifically, it contains most of the definition for the PIU behavior associated with memory accesses initiated by the local processor.
The Impact of 3D Stacking and Technology Scaling on the Power and Area of Stereo Matching Processors.

PubMed

Ok, Seung-Ho; Lee, Yong-Hwan; Shim, Jae Hoon; Lim, Sung Kyu; Moon, Byungin

2017-02-22

Recently, stereo matching processors have been adopted in real-time embedded systems such as intelligent robots and autonomous vehicles, which require minimal hardware resources and low power consumption. Meanwhile, thanks to the through-silicon via (TSV), three-dimensional (3D) stacking technology has emerged as a practical solution to achieving the desired requirements of a high-performance circuit. In this paper, we present the benefits of 3D stacking and process technology scaling on stereo matching processors. We implemented 2-tier 3D-stacked stereo matching processors with GlobalFoundries 130-nm and Nangate 45-nm process design kits and compare them with their two-dimensional (2D) counterparts to identify comprehensive design benefits. In addition, we examine the findings from various analyses to identify the power benefits of 3D-stacked integrated circuit (IC) and device technology advancements. From experiments, we observe that the proposed 3D-stacked ICs, compared to their 2D IC counterparts, obtain 43% area, 13% power, and 14% wire length reductions. In addition, we present a logic partitioning method suitable for a pipeline-based hardware architecture that minimizes the use of TSVs.
The Impact of 3D Stacking and Technology Scaling on the Power and Area of Stereo Matching Processors

PubMed Central

Ok, Seung-Ho; Lee, Yong-Hwan; Shim, Jae Hoon; Lim, Sung Kyu; Moon, Byungin

2017-01-01

Recently, stereo matching processors have been adopted in real-time embedded systems such as intelligent robots and autonomous vehicles, which require minimal hardware resources and low power consumption. Meanwhile, thanks to the through-silicon via (TSV), three-dimensional (3D) stacking technology has emerged as a practical solution to achieving the desired requirements of a high-performance circuit. In this paper, we present the benefits of 3D stacking and process technology scaling on stereo matching processors. We implemented 2-tier 3D-stacked stereo matching processors with GlobalFoundries 130-nm and Nangate 45-nm process design kits and compare them with their two-dimensional (2D) counterparts to identify comprehensive design benefits. In addition, we examine the findings from various analyses to identify the power benefits of 3D-stacked integrated circuit (IC) and device technology advancements. From experiments, we observe that the proposed 3D-stacked ICs, compared to their 2D IC counterparts, obtain 43% area, 13% power, and 14% wire length reductions. In addition, we present a logic partitioning method suitable for a pipeline-based hardware architecture that minimizes the use of TSVs. PMID:28241437
Accelerating Demand Paging for Local and Remote Out-of-Core Visualization

NASA Technical Reports Server (NTRS)

Ellsworth, David

2001-01-01

This paper describes a new algorithm that improves the performance of application-controlled demand paging for the out-of-core visualization of data sets that are on either local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The new algorithm can be applied to many different visualization algorithms since application-controlled demand paging is not specific to any visualization algorithm. The paper includes measurements that show that the new multi-threaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by up to 60%. Visualization runs using data from remote disk ran about as fast as ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.
Performance tuning of N-body codes on modern microprocessors: I. Direct integration with a hermite scheme on x86_64 architecture

NASA Astrophysics Data System (ADS)

Nitadori, Keigo; Makino, Junichiro; Hut, Piet

2006-12-01

The main performance bottleneck of gravitational N-body codes is the force calculation between two particles. We have succeeded in speeding up this pair-wise force calculation by factors between 2 and 10, depending on the code and the processor on which the code is run. These speed-ups were obtained by writing highly fine-tuned code for x86_64 microprocessors. Any existing N-body code, running on these chips, can easily incorporate our assembly code programs. In the current paper, we present an outline of our overall approach, which we illustrate with one specific example: the use of a Hermite scheme for a direct N2 type integration on a single 2.0 GHz Athlon 64 processor, for which we obtain an effective performance of 4.05 Gflops, for double-precision accuracy. In subsequent papers, we will discuss other variations, including the combinations of N log N codes, single-precision implementations, and performance on other microprocessors.
Real-time 3D adaptive filtering for portable imaging systems

NASA Astrophysics Data System (ADS)

Bockenbach, Olivier; Ali, Murtaza; Wainwright, Ian; Nadeski, Mark

2015-03-01

Portable imaging devices have proven valuable for emergency medical services both in the field and hospital environments and are becoming more prevalent in clinical settings where the use of larger imaging machines is impractical. 3D adaptive filtering is one of the most advanced techniques aimed at noise reduction and feature enhancement, but is computationally very demanding and hence often not able to run with sufficient performance on a portable platform. In recent years, advanced multicore DSPs have been introduced that attain high processing performance while maintaining low levels of power dissipation. These processors enable the implementation of complex algorithms like 3D adaptive filtering, improving the image quality of portable medical imaging devices. In this study, the performance of a 3D adaptive filtering algorithm on a digital signal processor (DSP) is investigated. The performance is assessed by filtering a volume of size 512x256x128 voxels sampled at a pace of 10 MVoxels/sec.
Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

NASA Technical Reports Server (NTRS)

Chen, Paul Peichuan

1993-01-01

Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.
A 4.8 kbps code-excited linear predictive coder

NASA Technical Reports Server (NTRS)

Tremain, Thomas E.; Campbell, Joseph P., Jr.; Welch, Vanoy C.

1988-01-01

A secure voice system STU-3 capable of providing end-to-end secure voice communications (1984) was developed. The terminal for the new system will be built around the standard LPC-10 voice processor algorithm. The performance of the present STU-3 processor is considered to be good, its response to nonspeech sounds such as whistles, coughs and impulse-like noises may not be completely acceptable. Speech in noisy environments also causes problems with the LPC-10 voice algorithm. In addition, there is always a demand for something better. It is hoped that LPC-10's 2.4 kbps voice performance will be complemented with a very high quality speech coder operating at a higher data rate. This new coder is one of a number of candidate algorithms being considered for an upgraded version of the STU-3 in late 1989. The problems of designing a code-excited linear predictive (CELP) coder to provide very high quality speech at a 4.8 kbps data rate that can be implemented on today's hardware are considered.
A Queue Simulation Tool for a High Performance Scientific Computing Center

NASA Technical Reports Server (NTRS)

Spear, Carrie; McGalliard, James

2007-01-01

The NASA Center for Computational Sciences (NCCS) at the Goddard Space Flight Center provides high performance highly parallel processors, mass storage, and supporting infrastructure to a community of computational Earth and space scientists. Long running (days) and highly parallel (hundreds of CPUs) jobs are common in the workload. NCCS management structures batch queues and allocates resources to optimize system use and prioritize workloads. NCCS technical staff use a locally developed discrete event simulation tool to model the impacts of evolving workloads, potential system upgrades, alternative queue structures and resource allocation policies.
Architectural Specialization for Inter-Iteration Loop Dependence Patterns

DTIC Science & Technology

2015-10-01

Architectural Specialization for Inter-Iteration Loop Dependence Patterns Christopher Batten Computer Systems Laboratory School of Electrical and...Trends in Computer Architecture Transistors (Thousands) Frequency (MHz) Typical Power (W) MIPS R2K Intel P4 DEC Alpha 21264 Data collected by M...T as ks p er Jo ule ) Simple Processor Design Power Constraint High-Performance Architectures Embedded Architectures Design Performance
Considerations for Multiprocessor Topologies

NASA Technical Reports Server (NTRS)

Byrd, Gregory T.; Delagi, Bruce A.

1987-01-01

Choosing a multiprocessor interconnection topology may depend on high-level considerations, such as the intended application domain and the expected number of processors. It certainly depends on low-level implementation details, such as packaging and communications protocols. The authors first use rough measures of cost and performance to characterize several topologies. They then examine how implementation details can affect the realizable performance of a topology.

Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems

DTIC Science & Technology

2013-12-01

Loss Spectra” Proceedings of SPIE 8255, (2012) and in a journal publication: M. T. Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O...Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O. Fimland and L. F. Lester, "Analytical Modeling of the Temperature Performance of
Satellite on-board real-time SAR processor prototype

NASA Astrophysics Data System (ADS)

Bergeron, Alain; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Marchese, Linda; Bourqui, Pascal; Desnoyers, Nicholas; Legros, Mathieu; Guillot, Ludovic; Mercier, Luc; Châteauneuf, François

2017-11-01

A Compact Real-Time Optronic SAR Processor has been successfully developed and tested up to a Technology Readiness Level of 4 (TRL4), the breadboard validation in a laboratory environment. SAR, or Synthetic Aperture Radar, is an active system allowing day and night imaging independent of the cloud coverage of the planet. The SAR raw data is a set of complex data for range and azimuth, which cannot be compressed. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Originally the first SAR images were optically processed. The optical Fourier processor architecture provides inherent parallel computing capabilities allowing real-time SAR data processing and thus the ability for compression and strongly reduced communication bandwidth requirements for the satellite. SAR signal return data are in general complex data. Both amplitude and phase must be combined optically in the SAR processor for each range and azimuth pixel. Amplitude and phase are generated by dedicated spatial light modulators and superimposed by an optical relay set-up. The spatial light modulators display the full complex raw data information over a two-dimensional format, one for the azimuth and one for the range. Since the entire signal history is displayed at once, the processor operates in parallel yielding real-time performances, i.e. without resulting bottleneck. Processing of both azimuth and range information is performed in a single pass. This paper focuses on the onboard capabilities of the compact optical SAR processor prototype that allows in-orbit processing of SAR images. Examples of processed ENVISAT ASAR images are presented. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed.
Comparison of speech perception performance between Sprint/Esprit 3G and Freedom processors in children implanted with nucleus cochlear implants.

PubMed

Santarelli, Rosamaria; Magnavita, Vincenzo; De Filippi, Roberta; Ventura, Laura; Genovese, Elisabetta; Arslan, Edoardo

2009-04-01

To compare speech perception performance in children fitted with previous generation Nucleus sound processor, Sprint or Esprit 3G, and the Freedom, the most recently released system from the Cochlear Corporation that features a larger input dynamic range. Prospective intrasubject comparative study. University Medical Center. Seventeen prelingually deafened children who had received the Nucleus 24 cochlear implant and used the Sprint or Esprit 3G sound processor. Cochlear implantation with Cochlear device. Speech perception was evaluated at baseline (Sprint, n = 11; Esprit 3G, n = 6) and after 1 month's experience with the Freedom sound processor. Identification and recognition of disyllabic words and identification of vowels were performed via recorded voice in quiet (70 dB [A]), in the presence of background noise at various levels of signal-to-noise ratio (+10, +5, 0, -5) and at a soft presentation level (60 dB [A]). Consonant identification and recognition of disyllabic words, trisyllabic words, and sentences were evaluated in live voice. Frequency discrimination was measured in a subset of subjects (n = 5) by using an adaptive, 3-interval, 3-alternative, forced-choice procedure. Identification of disyllabic words administered at a soft presentation level showed a significant increase when switching to the Freedom compared with the previously worn processor in children using the Sprint or Esprit 3G. Identification and recognition of disyllabic words in the presence of background noise as well as consonant identification and sentence recognition increased significantly for the Freedom compared with the previously worn device only in children fitted with the Sprint. Frequency discrimination was significantly better when switching to the Freedom compared with the previously worn processor. Serial comparisons revealed that that speech perception performance evaluated in children aged 5 to 15 years was superior with the Freedom than previous generations of Nucleus sound processors. These differences are deemed to ensue from an increased input dynamic range, a feature that offers potentially enhanced phonemic discrimination.
An "artificial retina" processor for track reconstruction at the full LHC crossing rate

NASA Astrophysics Data System (ADS)

Abba, A.; Bedeschi, F.; Caponio, F.; Cenci, R.; Citterio, M.; Cusimano, A.; Fu, J.; Geraci, A.; Grizzuti, M.; Lusardi, N.; Marino, P.; Morello, M. J.; Neri, N.; Ninci, D.; Petruzzo, M.; Piucci, A.; Punzi, G.; Ristori, L.; Spinella, F.; Stracka, S.; Tonelli, D.; Walsh, J.

2016-07-01

We present the latest results of an R&D study for a specialized processor capable of reconstructing, in a silicon pixel detector, high-quality tracks from high-energy collision events at 40 MHz. The processor applies a highly parallel pattern-recognition algorithm inspired to quick detection of edges in mammals visual cortex. After a detailed study of a real-detector application, demonstrating that online reconstruction of offline-quality tracks is feasible at 40 MHz with sub-microsecond latency, we are implementing a prototype using common high-bandwidth FPGA devices.
An "artificial retina" processor for track reconstruction at the full LHC crossing rate

DOE PAGES

Abba, A.; F. Bedeschi; Caponio, F.; ...

2015-10-23

Here, we present the latest results of an R&D; study for a specialized processor capable of reconstructing, in a silicon pixel detector, high-quality tracks from high-energy collision events at 40 MHz. The processor applies a highly parallel pattern-recognition algorithm inspired to quick detection of edges in mammals visual cortex. After a detailed study of a real-detector application, demonstrating that online reconstruction of offline-quality tracks is feasible at 40 MHz with sub-microsecond latency, we are implementing a prototype using common high-bandwidth FPGA devices.
An Application-Based Performance Characterization of the Columbia Supercluster

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash

2005-01-01

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.
Integrated Advanced Microwave Sounding Unit-A (AMSU-A): Engineering Test Report: METSAT A2 Signal Processor (P/N 1331120-2, S/N F03) S/N 107

NASA Technical Reports Server (NTRS)

1998-01-01

This report presents a description of the tests performed, and the test data, for the A2 METSAT Signal Processor Assembly PN: 1331120-2, S/N F03. The assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive Test and Integration Procedure."
Integrated Advanced Microwave Sounding Unit-A (AMSU-A): Engineering Test Report, METSAT A2 Signal Processor (P/N 1331120-2, S/N F04) S/N 108

NASA Technical Reports Server (NTRS)

1998-01-01

This report presents a description of the tests performed, and the test data, for the A2 METSAT Signal Processor Assembly PN: 1331120-2, S/N F04. The assembly was tested in accordance with AE-26754, "METSAT Signal Processor Scan Drive Test and Integration Procedure."
Processor core for real time background identification of HD video based on OpenCV Gaussian mixture model algorithm

NASA Astrophysics Data System (ADS)

Genovese, Mariangela; Napoli, Ettore

2013-05-01

The identification of moving objects is a fundamental step in computer vision processing chains. The development of low cost and lightweight smart cameras steadily increases the request of efficient and high performance circuits able to process high definition video in real time. The paper proposes two processor cores aimed to perform the real time background identification on High Definition (HD, 1920 1080 pixel) video streams. The implemented algorithm is the OpenCV version of the Gaussian Mixture Model (GMM), an high performance probabilistic algorithm for the segmentation of the background that is however computationally intensive and impossible to implement on general purpose CPU with the constraint of real time processing. In the proposed paper, the equations of the OpenCV GMM algorithm are optimized in such a way that a lightweight and low power implementation of the algorithm is obtained. The reported performances are also the result of the use of state of the art truncated binary multipliers and ROM compression techniques for the implementation of the non-linear functions. The first circuit has commercial FPGA devices as a target and provides speed and logic resource occupation that overcome previously proposed implementations. The second circuit is oriented to an ASIC (UMC-90nm) standard cell implementation. Both implementations are able to process more than 60 frames per second in 1080p format, a frame rate compatible with HD television.
Comparison of Communication Architectures and Network Topologies for Distributed Propulsion Controls (Preprint)

DTIC Science & Technology

2013-05-01

logic to perform control function computations and are connected to the full authority digital engine control ( FADEC ) via a high-speed data...Digital Engine Control ( FADEC ) via a high speed data communication bus. The short term distributed engine control configu- rations will be core...concen- trator; and high temperature electronics, high speed communication bus between the data concentrator and the control law processor master FADEC
An Early Quantum Computing Proposal

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lee, Stephen Russell; Alexander, Francis Joseph; Barros, Kipton Marcos

The D-Wave 2X is the third generation of quantum processing created by D-Wave. NASA (with Google and USRA) and Lockheed Martin (with USC), both own D-Wave systems. Los Alamos National Laboratory (LANL) purchased a D-Wave 2X in November 2015. The D-Wave 2X processor contains (nominally) 1152 quantum bits (or qubits) and is designed to specifically perform quantum annealing, which is a well-known method for finding a global minimum of an optimization problem. This methodology is based on direct execution of a quantum evolution in experimental quantum hardware. While this can be a powerful method for solving particular kinds of problems,more » it also means that the D-Wave 2X processor is not a general computing processor and cannot be programmed to perform a wide variety of tasks. It is a highly specialized processor, well beyond what NNSA currently thinks of as an “advanced architecture.”A D-Wave is best described as a quantum optimizer. That is, it uses quantum superposition to find the lowest energy state of a system by repeated doses of power and settling stages. The D-Wave produces multiple solutions to any suitably formulated problem, one of which is the lowest energy state solution (global minimum). Mapping problems onto the D-Wave requires defining an objective function to be minimized and then encoding that function in the Hamiltonian of the D-Wave system. The quantum annealing method is then used to find the lowest energy configuration of the Hamiltonian using the current D-Wave Two, two-level, quantum processor. This is not always an easy thing to do, and the D-Wave Two has significant limitations that restrict problem sizes that can be run and algorithmic choices that can be made. Furthermore, as more people are exploring this technology, it has become clear that it is very difficult to come up with general approaches to optimization that can both utilize the D-Wave and that can do better than highly developed algorithms on conventional computers for specific applications. These are all fundamental challenges that must be overcome for the D-Wave, or similar, quantum computing technology to be broadly applicable.« less
Automatic film processors' quality control test in Greek military hospitals.

PubMed

Lymberis, C; Efstathopoulos, E P; Manetou, A; Poudridis, G

1993-04-01

The two major military radiology installations (Athens, Greece) using a total of 15 automatic film processors were assessed using the 21-step-wedge method. The results of quality control in all these processors are presented. The parameters measured under actual working conditions were base and fog, contrast and speed. Base and fog as well as speed displayed large variations with average values generally higher than acceptable, whilst contrast displayed greater stability. Developer temperature was measured daily during the test and was found to be outside the film manufacturers' recommended limits in nine of the 15 processors. In only one processor did film passing time vary on an every day basis and this was due to maloperation. Developer pH test was not part of the daily monitoring service being performed every 5 days for each film processor and found to be in the range 9-12; 10 of the 15 processors presented pH values outside the limits specified by the film manufacturers.
50 CFR 679.50 - Groundfish Observer Program.

Code of Federal Regulations, 2012 CFR

2012-10-01

... completion of the electronic vessel and/or processor survey(s); (B) Complete NMFS electronic vessel and/or processor surveys before performing other jobs or duties which are not part of NMFS groundfish observer...
50 CFR 679.50 - Groundfish Observer Program.

Code of Federal Regulations, 2011 CFR

2011-10-01

... completion of the electronic vessel and/or processor survey(s); (B) Complete NMFS electronic vessel and/or processor surveys before performing other jobs or duties which are not part of NMFS groundfish observer...
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

PubMed

Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

2010-01-11

Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
Integrated High-Speed Torque Control System for a Robotic Joint

NASA Technical Reports Server (NTRS)

Davis, Donald R. (Inventor); Radford, Nicolaus A. (Inventor); Permenter, Frank Noble (Inventor); Valvo, Michael C. (Inventor); Askew, R. Scott (Inventor)

2013-01-01

A control system for achieving high-speed torque for a joint of a robot includes a printed circuit board assembly (PCBA) having a collocated joint processor and high-speed communication bus. The PCBA may also include a power inverter module (PIM) and local sensor conditioning electronics (SCE) for processing sensor data from one or more motor position sensors. Torque control of a motor of the joint is provided via the PCBA as a high-speed torque loop. Each joint processor may be embedded within or collocated with the robotic joint being controlled. Collocation of the joint processor, PIM, and high-speed bus may increase noise immunity of the control system, and the localized processing of sensor data from the joint motor at the joint level may minimize bus cabling to and from each control node. The joint processor may include a field programmable gate array (FPGA).
Fast Fourier Transform Co-Processor (FFTC)- Towards Embedded GFLOPs

NASA Astrophysics Data System (ADS)

Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Wite, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland

2012-08-01

Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co- Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment.In frame of the ESA activity “Fast Fourier Transform DSP Co-processor (FFTC)” (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following:Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP.The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance.The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT- based processing tasks.A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses.The presentation will give and overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs

NASA Astrophysics Data System (ADS)

Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Witte, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland; Kopp, Nicholas

2012-10-01

Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co-Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment. In frame of the ESA activity "Fast Fourier Transform DSP Co-processor (FFTC)" (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following: • Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP. • The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance. The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT-based processing tasks. A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses. The paper will give an overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Pausing and activating thread state upon pin assertion by external logic monitoring polling loop exit time condition

DOEpatents

Chen, Dong; Giampapa, Mark; Heidelberger, Philip; Ohmacht, Martin; Satterfield, David L; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan

2013-05-21

A system and method for enhancing performance of a computer which includes a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processer. The processor processes instructions from the program. A wait state in the processor waits for receiving specified data. A thread in the processor has a pause state wherein the processor waits for specified data. A pin in the processor initiates a return to an active state from the pause state for the thread. A logic circuit is external to the processor, and the logic circuit is configured to detect a specified condition. The pin initiates a return to the active state of the thread when the specified condition is detected using the logic circuit.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

1984-01-01

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.

DESDynI Quad First Stage Processor - A Four Channel Digitizer and Digital Beam Forming Processor

NASA Technical Reports Server (NTRS)

Chuang, Chung-Lun; Shaffer, Scott; Smythe, Robert; Niamsuwan, Noppasin; Li, Samuel; Liao, Eric; Lim, Chester; Morfopolous, Arin; Veilleux, Louise

2013-01-01

The proposed Deformation, Eco-Systems, and Dynamics of Ice Radar (DESDynI-R) L-band SAR instrument employs multiple digital channels to optimize resolution while keeping a large swath on a single pass. High-speed digitization with very fine synchronization and digital beam forming are necessary in order to facilitate this new technique. The Quad First Stage Processor (qFSP) was developed to achieve both the processing performance as well as the digitizing fidelity in order to accomplish this sweeping SAR technique. The qFSP utilizes high precision and high-speed analog to digital converters (ADCs), each with a finely adjustable clock distribution network to digitize the channels at the fidelity necessary to allow for digital beam forming. The Xilinx produced FX130T Virtex 5 part handles the processing to digitally calibrate each channel as well as filter and beam form the receive signals. Demonstrating the digital processing required for digital beam forming and digital calibration is instrumental to the viability of the proposed DESDynI instrument. The qFSP development brings this implementation to Technology Readiness Level (TRL) 6. This paper will detail the design and development of the prototype qFSP as well as the preliminary results from hardware tests.
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
Fabrication Security and Trust of Domain-Specific ASIC Processors

DTIC Science & Technology

2016-10-30

embedded in the design. For example , an ASIC processor potentially has a 10-1,000X performance advantage over its FPGA and GPP counterparts, but...paper by summarizing our lessons learned from this project and suggests a few research directions. II. DOMAIN-SPECIFIC ASIC PROCESSORS As Figure 1 has...sponsored by the Assistant Secretary of Defense for Research & Engineering under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations
State recovery and lockstep execution restart in a system with multiprocessor pairing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gara, Alan; Gschwind, Michael K; Salapura, Valentina

System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switchmore » or a bus. Each selectively paired processor core is includes a transactional execution facility, whereing the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.« less
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems.

PubMed

Shehzad, Danish; Bozkuş, Zeki

2016-01-01

Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models.
Optimizing NEURON Simulation Environment Using Remote Memory Access with Recursive Doubling on Distributed Memory Systems

PubMed Central

Bozkuş, Zeki

2016-01-01

Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models. PMID:27413363
Millimeter-wave passive ultra-compact imaging technology for synthetic vision & mobile platforms

NASA Technical Reports Server (NTRS)

Olsen, Randall

1996-01-01

Substantial technical progress was made on all of the three high-risk subsystems of this program. The subsystems include dielectric antenna, G-band receiver, and electro-optic image processor. Progress is approximately on-schedule for both the receiver and the electro-optic processor development, while greater than anticipated challenges have been discovered in the dielectric antenna development. Much of the information in this report was covered in greater detail in the One-Year Review Meeting held at TTC on 22 February 1996. The performance goals of the dielectric antenna project are: Scan Angle -- 20 deg. desired; Loss -- 6 dB end to end (3 dB average); Frequency -- 206-218 GHz (6% bandwidth); Beam width -- 0.25 deg.; and Length -- 12 inches. The scan angle requirement was chosen to satisfy the needs of aircraft pilots. This requirement, coupled with the presently limited bandwidth processors (1 GHz state-of-the-art and 12 GHz in development in this program) forces the antenna to be dielectric (high scan angle air-filled waveguide-based antennas would be too lossy and their performance would vary too much as a function of frequency). A high dielectric constant (e.g., 10) was initially chosen for the dielectric material. This choice lead to the following fabrication challenges: total thickness variation (TTV) tolerance is 1 micrometer; coupler spacing tolerance is 1 micrometer; width tolerance is larger, but unknown, and the surfaces must have mirror finish. Also of importance is the difficulty in obtaining raw materials that satisfy the overall length requirement of 12 inches while simultaneously satisfying the above specifications.
Computational efficiency of parallel combinatorial OR-tree searches

NASA Technical Reports Server (NTRS)

Li, Guo-Jie; Wah, Benjamin W.

1990-01-01

The performance of parallel combinatorial OR-tree searches is analytically evaluated. This performance depends on the complexity of the problem to be solved, the error allowance function, the dominance relation, and the search strategies. The exact performance may be difficult to predict due to the nondeterminism and anomalies of parallelism. The authors derive the performance bounds of parallel OR-tree searches with respect to the best-first, depth-first, and breadth-first strategies, and verify these bounds by simulation. They show that a near-linear speedup can be achieved with respect to a large number of processors for parallel OR-tree searches. Using the bounds developed, the authors derive sufficient conditions for assuring that parallelism will not degrade performance and necessary conditions for allowing parallelism to have a speedup greater than the ratio of the numbers of processors. These bounds and conditions provide the theoretical foundation for determining the number of processors required to assure a near-linear speedup.
Development of a small-scale computer cluster

NASA Astrophysics Data System (ADS)

Wilhelm, Jay; Smith, Justin T.; Smith, James E.

2008-04-01

An increase in demand for computing power in academia has necessitated the need for high performance machines. Computing power of a single processor has been steadily increasing, but lags behind the demand for fast simulations. Since a single processor has hard limits to its performance, a cluster of computers can have the ability to multiply the performance of a single computer with the proper software. Cluster computing has therefore become a much sought after technology. Typical desktop computers could be used for cluster computing, but are not intended for constant full speed operation and take up more space than rack mount servers. Specialty computers that are designed to be used in clusters meet high availability and space requirements, but can be costly. A market segment exists where custom built desktop computers can be arranged in a rack mount situation, gaining the space saving of traditional rack mount computers while remaining cost effective. To explore these possibilities, an experiment was performed to develop a computing cluster using desktop components for the purpose of decreasing computation time of advanced simulations. This study indicates that small-scale cluster can be built from off-the-shelf components which multiplies the performance of a single desktop machine, while minimizing occupied space and still remaining cost effective.
Novel processor architecture for onboard infrared sensors

NASA Astrophysics Data System (ADS)

Hihara, Hiroki; Iwasaki, Akira; Tamagawa, Nobuo; Kuribayashi, Mitsunobu; Hashimoto, Masanori; Mitsuyama, Yukio; Ochi, Hiroyuki; Onodera, Hidetoshi; Kanbara, Hiroyuki; Wakabayashi, Kazutoshi; Tada, Munehiro

2016-09-01

Infrared sensor system is a major concern for inter-planetary missions that investigate the nature and the formation processes of planets and asteroids. The infrared sensor system requires signal preprocessing functions that compensate for the intensity of infrared image sensors to get high quality data and high compression ratio through the limited capacity of transmission channels towards ground stations. For those implementations, combinations of Field Programmable Gate Arrays (FPGAs) and microprocessors are employed by AKATSUKI, the Venus Climate Orbiter, and HAYABUSA2, the asteroid probe. On the other hand, much smaller size and lower power consumption are demanded for future missions to accommodate more sensors. To fulfill this future demand, we developed a novel processor architecture which consists of reconfigurable cluster cores and programmable-logic cells with complementary atom switches. The complementary atom switches enable hardware programming without configuration memories, and thus soft-error on logic circuit connection is completely eliminated. This is a noteworthy advantage for space applications which cannot be found in conventional re-writable FPGAs. Almost one-tenth of lower power consumption is expected compared to conventional re-writable FPGAs because of the elimination of configuration memories. The proposed processor architecture can be reconfigured by behavioral synthesis with higher level language specification. Consequently, compensation functions are implemented in a single chip without accommodating program memories, which is accompanied with conventional microprocessors, while maintaining the comparable performance. This enables us to embed a processor element on each infrared signal detector output channel.
Real-time million-synapse simulation of rat barrel cortex.

PubMed

Sharp, Thomas; Petersen, Rasmus; Furber, Steve

2014-01-01

Simulations of neural circuits are bounded in scale and speed by available computing resources, and particularly by the differences in parallelism and communication patterns between the brain and high-performance computers. SpiNNaker is a computer architecture designed to address this problem by emulating the structure and function of neural tissue, using very many low-power processors and an interprocessor communication mechanism inspired by axonal arbors. Here we demonstrate that thousand-processor SpiNNaker prototypes can simulate models of the rodent barrel system comprising 50,000 neurons and 50 million synapses. We use the PyNN library to specify models, and the intrinsic features of Python to control experimental procedures and analysis. The models reproduce known thalamocortical response transformations, exhibit known, balanced dynamics of excitation and inhibition, and show a spatiotemporal spread of activity though the superficial cortical layers. These demonstrations are a significant step toward tractable simulations of entire cortical areas on the million-processor SpiNNaker machines in development.
The Alaska SAR processor - Operations and control

NASA Technical Reports Server (NTRS)

Carande, Richard E.

1989-01-01

The Alaska SAR (synthetic-aperture radar) Facility (ASF) will be capable of receiving, processing, archiving, and producing a variety of SAR image products from three satellite-borne SARs: E-ERS-1 (ESA), J-ERS-1 (NASDA) and Radarsat (Canada). Crucial to the success of the ASF is the Alaska SAR processor (ASP), which will be capable of processing over 200 100-km x 100-km (Seasat-like) frames per day from the raw SAR data, at a ground resolution of about 30 m x 30 m. The processed imagery is of high geometric and radiometric accuracy, and is geolocated to within 500 m. Special-purpose hardware has been designed to execute a SAR processing algorithm to achieve this performance. This hardware is currently undergoing acceptance testing for delivery to the University of Alaska. Particular attention has been devoted to making the operations semi-automated and to providing a friendly operator interface via a computer workstation. The operations and control of the Alaska SAR processor are described.
Towards the formal verification of the requirements and design of a processor interface unit: HOL listings

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

This technical report contains the Higher-Order Logic (HOL) listings of the partial verification of the requirements and design for a commercially developed processor interface unit (PIU). The PIU is an interface chip performing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault tolerant computer system. This system, the Fault Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. This report contains the actual HOL listings of the PIU verification as it currently exists. Section two of this report contains general-purpose HOL theories and definitions that support the PIU verification. These include arithmetic theories dealing with inequalities and associativity, and a collection of tactics used in the PIU proofs. Section three contains the HOL listings for the completed PIU design verification. Section 4 contains the HOL listings for the partial requirements verification of the P-Port.
Real-time million-synapse simulation of rat barrel cortex

PubMed Central

Sharp, Thomas; Petersen, Rasmus; Furber, Steve

2014-01-01

Simulations of neural circuits are bounded in scale and speed by available computing resources, and particularly by the differences in parallelism and communication patterns between the brain and high-performance computers. SpiNNaker is a computer architecture designed to address this problem by emulating the structure and function of neural tissue, using very many low-power processors and an interprocessor communication mechanism inspired by axonal arbors. Here we demonstrate that thousand-processor SpiNNaker prototypes can simulate models of the rodent barrel system comprising 50,000 neurons and 50 million synapses. We use the PyNN library to specify models, and the intrinsic features of Python to control experimental procedures and analysis. The models reproduce known thalamocortical response transformations, exhibit known, balanced dynamics of excitation and inhibition, and show a spatiotemporal spread of activity though the superficial cortical layers. These demonstrations are a significant step toward tractable simulations of entire cortical areas on the million-processor SpiNNaker machines in development. PMID:24910593
Airborne optical tracking control system design study

NASA Astrophysics Data System (ADS)

1992-09-01

The Kestrel LOS Tracking Program involves the development of a computer and algorithms for use in passive tracking of airborne targets from a high altitude balloon platform. The computer receivers track error signals from a video tracker connected to one of the imaging sensors. In addition, an on-board IRU (gyro), accelerometers, a magnetometer, and a two-axis inclinometer provide inputs which are used for initial acquisitions and course and fine tracking. Signals received by the control processor from the video tracker, IRU, accelerometers, magnetometer, and inclinometer are utilized by the control processor to generate drive signals for the payload azimuth drive, the Gimballed Mirror System (GMS), and the Fast Steering Mirror (FSM). The hardware which will be procured under the LOS tracking activity is the Controls Processor (CP), the IRU, and the FSM. The performance specifications for the GMS and the payload canister azimuth driver are established by the LOS tracking design team in an effort to achieve a tracking jitter of less than 3 micro-rad, 1 sigma for one axis.
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response

NASA Technical Reports Server (NTRS)

Abdi, Frank

1996-01-01

A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Job-mix modeling and system analysis of an aerospace multiprocessor.

NASA Technical Reports Server (NTRS)

Mallach, E. G.

1972-01-01

An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.
Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Imam, Neena; Humble, Travis S.; McCaskey, Alex

A content-addressable memory (CAM) stores key-value associations such that the key is recalled by providing its associated value. While CAM recall is traditionally performed using recurrent neural network models, we show how to solve this problem using adiabatic quantum optimization. Our approach maps the recurrent neural network to a commercially available quantum processing unit by taking advantage of the common underlying Ising spin model. We then assess the accuracy of the quantum processor to store key-value associations by quantifying recall performance against an ensemble of problem sets. We observe that different learning rules from the neural network community influence recallmore » accuracy but performance appears to be limited by potential noise in the processor. The strong connection established between quantum processors and neural network problems supports the growing intersection of these two ideas.« less
Present Status and Extensions of the Monte Carlo Performance Benchmark

NASA Astrophysics Data System (ADS)

Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.

2014-06-01

The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.

Comparing the Performance of Two Dynamic Load Distribution Methods

NASA Technical Reports Server (NTRS)

Kale, L. V.

1987-01-01

Parallel processing of symbolic computations on a message-passing multi-processor presents one challenge: To effectively utilize the available processors, the load must be distributed uniformly to all the processors. However, the structure of these computations cannot be predicted in advance. go, static scheduling methods are not applicable. In this paper, we compare the performance of two dynamic, distributed load balancing methods with extensive simulation studies. The two schemes are: the Contracting Within a Neighborhood (CWN) scheme proposed by us, and the Gradient Model proposed by Lin and Keller. We conclude that although simpler, the CWN is significantly more effective at distributing the work than the Gradient model.
Software handlers for process interfaces

NASA Technical Reports Server (NTRS)

Bercaw, R. W.

1976-01-01

The principles involved in the development of software handlers for custom interfacing problems are discussed. Handlers for the CAMAC standard are examined in detail. The types of transactions that must be supported have been established by standards groups, eliminating conflicting requirements arising out of different design philosophies and applications. Implementation of the standard handlers has been facilititated by standardization of hardware. The necessary local processors can be placed in the handler when it is written or at run time by means of input/output directives, or they can be built into a high-performance input/output processor. The full benefits of these process interfaces will only be realized when software requirements are incorporated uniformly into the hardware.
Data flow language and interpreter for a reconfigurable distributed data processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hurt, A.D.; Heath, J.R.

1982-01-01

An analytic language and an interpreter whereby an applications data flow graph may serve as an input to a reconfigurable distributed data processor is proposed. The architecture considered consists of a number of loosely coupled computing elements (CES) which may be linked to data and file memories through fully nonblocking interconnect networks. The real-time performance of such an architecture depends upon its ability to alter its topology in response to changes in application, asynchronous data rates and faults. Such a data flow language enhances the versatility of a reconfigurable architecture by allowing the user to specify the machine's topology atmore » a very high level. 11 references.« less
Recovery Act - CAREER: Sustainable Silicon -- Energy-Efficient VLSI Interconnect for Extreme-Scale Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chiang, Patrick

2014-01-31

The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
Evaluation of the Xeon phi processor as a technology for the acceleration of real-time control in high-order adaptive optics systems

NASA Astrophysics Data System (ADS)

Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine

2014-08-01

We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.
Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

NASA Technical Reports Server (NTRS)

Das, Sajal K.; Harvey, Daniel J.; Biswas, Rupak

2001-01-01

The Information Power Grid (IPG) concept developed by NASA is aimed to provide a metacomputing platform for large-scale distributed computations, by hiding the intricacies of highly heterogeneous environment and yet maintaining adequate security. In this paper, we propose a latency-tolerant partitioning scheme that dynamically balances processor workloads on the.IPG, and minimizes data movement and runtime communication. By simulating an unsteady adaptive mesh application on a wide area network, we study the performance of our load balancer under the Globus environment. The number of IPG nodes, the number of processors per node, and the interconnected speeds are parameterized to derive conditions under which the IPG would be suitable for parallel distributed processing of such applications. Experimental results demonstrate that effective solution are achieved when the IPG nodes are connected by a high-speed asynchronous interconnection network.
System-Level Radiation Hardening

NASA Technical Reports Server (NTRS)

Ladbury, Ray

2014-01-01

Although system-level radiation hardening can enable the use of high-performance components and enhance the capabilities of a spacecraft, hardening techniques can be costly and can compromise the very performance designers sought from the high-performance components. Moreover, such techniques often result in a complicated design, especially if several complex commercial microcircuits are used, each posing its own hardening challenges. The latter risk is particularly acute for Commercial-Off-The-Shelf components since high-performance parts (e.g. double-data-rate synchronous dynamic random access memories - DDR SDRAMs) may require other high-performance commercial parts (e.g. processors) to support their operation. For these reasons, it is essential that system-level radiation hardening be a coordinated effort, from setting requirements through testing up to and including validation.
Simulink/PARS Integration Support

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vacaliuc, B.; Nakhaee, N.

2013-12-18

The state of the art for signal processor hardware has far out-paced the development tools for placing applications on that hardware. In addition, signal processors are available in a variety of architectures, each uniquely capable of handling specific types of signal processing efficiently. With these processors becoming smaller and demanding less power, it has become possible to group multiple processors, a heterogeneous set of processors, into single systems. Different portions of the desired problem set can be assigned to different processor types as appropriate. As software development tools do not keep pace with these processors, especially when multiple processors ofmore » different types are used, a method is needed to enable software code portability among multiple processors and multiple types of processors along with their respective software environments. Sundance DSP, Inc. has developed a software toolkit called “PARS”, whose objective is to provide a framework that uses suites of tools provided by different vendors, along with modeling tools and a real time operating system, to build an application that spans different processor types. The software language used to express the behavior of the system is a very high level modeling language, “Simulink”, a MathWorks product. ORNL has used this toolkit to effectively implement several deliverables. This CRADA describes this collaboration between ORNL and Sundance DSP, Inc.« less
Improved Remapping Processor For Digital Imagery

NASA Technical Reports Server (NTRS)

Fisher, Timothy E.

1991-01-01

Proposed digital image processor improved version of Programmable Remapper, which performs geometric and radiometric transformations on digital images. Features include overlapping and variably sized preimages. Overcomes some of limitations of image-warping circuit boards implementing only those geometric tranformations expressible in terms of polynomials of limited order. Also overcomes limitations of existing Programmable Remapper and made to perform transformations at video rate.
Cognitive Tools and Thinking Performance: The Case of Word Processors and Writing.

ERIC Educational Resources Information Center

Kellogg, Ronald T.; Mueller, Suzanne

A study examined whether word processing amplifies writing performance and whether it restructures the process of writing. Sixteen college students wrote a short essay in a single session on either a word processor or in longhand. The quality of the essays was assessed by trained judges who rated their content and style. Text analysis methods were…
Algorithms for parallel flow solvers on message passing architectures

NASA Technical Reports Server (NTRS)

Vanderwijngaart, Rob F.

1995-01-01

The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those immediately adjacent to them, then the first processor in the pipeline will receive a computational load that is less than that of subsequent processors, magnifying the pipeline slowdown effect. Extra compensation is needed for grid boundary effects, even if all grid blocks are equally sized.
Parallel processing in a host plus multiple array processor system for radar

NASA Technical Reports Server (NTRS)

Barkan, B. Z.

1983-01-01

Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.
Effects of input processing and type of personal frequency modulation system on speech-recognition performance of adults with cochlear implants.

PubMed

Wolfe, Jace; Schafer, Erin; Parkinson, Aaron; John, Andrew; Hudson, Mary; Wheeler, Julie; Mucci, Angie

2013-01-01

The objective of this study was to compare speech recognition in quiet and in noise for cochlear implant recipients using two different types of personal frequency modulation (FM) systems (directly coupled [direct auditory input] versus induction neckloop) with each of two sound processors (Cochlear Nucleus Freedom versus Cochlear Nucleus 5). Two different experiments were conducted within this study. In both these experiments, mixing of the FM signal within the Freedom processor was implemented via the same scheme used clinically for the Freedom sound processor. In Experiment 1, the aforementioned comparisons were conducted with the Nucleus 5 programmed so that the microphone and FM signals were mixed and then the mixed signals were subjected to autosensitivity control (ASC). In Experiment 2, comparisons between the two FM systems and processors were conducted again with the Nucleus 5 programmed to provide a more complex multistage implementation of ASC during the preprocessing stage. This study was a within-subject, repeated-measures design. Subjects were recruited from the patient population at the Hearts for Hearing Foundation in Oklahoma City, OK. Fifteen subjects participated in Experiment 1, and 16 subjects participated in Experiment 2. Subjects were adults who had used either unilateral or bilateral cochlear implants for at least 1 year. In this experiment, no differences were found in speech recognition in quiet obtained with the two different FM systems or the various sound-processor conditions. With each sound processor, speech recognition in noise was better with the directly coupled direct auditory input system relative to the neckloop system. The multistage ASC processing of the Nucleus 5 sound processor provided better performance than the single-stage approach for the Nucleus 5 and the Nucleus Freedom sound processor. Speech recognition in noise is substantially affected by the type of sound processor, FM system, and implementation of ASC used by a Cochlear implant recipient.
Switch for serial or parallel communication networks

DOEpatents

Crosette, D.B.

1994-07-19

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.
Switch for serial or parallel communication networks

DOEpatents

Crosette, Dario B.

1994-01-01

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.
Implementation of a High-Speed FPGA and DSP Based FFT Processor for Improving Strain Demodulation Performance in a Fiber-Optic-Based Sensing System

NASA Technical Reports Server (NTRS)

Farley, Douglas L.

2005-01-01

NASA's Aviation Safety and Security Program is pursuing research in on-board Structural Health Management (SHM) technologies for purposes of reducing or eliminating aircraft accidents due to system and component failures. Under this program, NASA Langley Research Center (LaRC) is developing a strain-based structural health-monitoring concept that incorporates a fiber optic-based measuring system for acquiring strain values. This fiber optic-based measuring system provides for the distribution of thousands of strain sensors embedded in a network of fiber optic cables. The resolution of strain value at each discrete sensor point requires a computationally demanding data reduction software process that, when hosted on a conventional processor, is not suitable for near real-time measurement. This report describes the development and integration of an alternative computing environment using dedicated computing hardware for performing the data reduction. Performance comparison between the existing and the hardware-based system is presented.
Impacts of the IBM Cell Processor to Support Climate Models

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

2008-01-01

NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.
Advanced data management system architectures testbed

NASA Technical Reports Server (NTRS)

Grant, Terry

1990-01-01

The objective of the Architecture and Tools Testbed is to provide a working, experimental focus to the evolving automation applications for the Space Station Freedom data management system. Emphasis is on defining and refining real-world applications including the following: the validation of user needs; understanding system requirements and capabilities; and extending capabilities. The approach is to provide an open, distributed system of high performance workstations representing both the standard data processors and networks and advanced RISC-based processors and multiprocessor systems. The system provides a base from which to develop and evaluate new performance and risk management concepts and for sharing the results. Participants are given a common view of requirements and capability via: remote login to the testbed; standard, natural user interfaces to simulations and emulations; special attention to user manuals for all software tools; and E-mail communication. The testbed elements which instantiate the approach are briefly described including the workstations, the software simulation and monitoring tools, and performance and fault tolerance experiments.
A digital retina-like low-level vision processor.

PubMed

Mertoguno, S; Bourbakis, N G

2003-01-01

This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper.
Support for Diagnosis of Custom Computer Hardware

NASA Technical Reports Server (NTRS)

Molock, Dwaine S.

2008-01-01

The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.

Design and implementation of projects with Xilinx Zynq FPGA: a practical case

NASA Astrophysics Data System (ADS)

Travaglini, R.; D'Antone, I.; Meneghini, S.; Rignanese, L.; Zuffa, M.

The main advantage when using FPGAs with embedded processors is the availability of additional several high-performance resources in the same physical device. Moreover, the FPGA programmability allows for connect custom peripherals. Xilinx have designed a programmable device named Zynq-7000 (simply called Zynq in the following), which integrates programmable logic (identical to the other Xilinx "serie 7" devices) with a System on Chip (SOC) based on two embedded ARM processors. Since both parts are deeply connected, the designers benefit from performance of hardware SOC and flexibility of programmability as well. In this paper a design developed by the Electronic Design Department at the Bologna Division of INFN will be presented as a practical case of project based on Zynq device. It is developed by using a commercial board called ZedBoard hosting a FMC mezzanine with a 12-bit 500 MS/s ADC. The Zynq FPGA on the ZedBoard receives digital outputs from the ADC and send them to the acquisition PC, after proper formatting, through a Gigabit Ethernet link. The major focus of the paper will be about the methodology to develop a Zynq-based design with the Xilinx Vivado software, enlightening how to configure the SOC and connect it with the programmable logic. Firmware design techniques will be presented: in particular both VHDL and IP core based strategies will be discussed. Further, the procedure to develop software for the embedded processor will be presented. Finally, some debugging tools, like the embedded Logic Analyzer, will be shown. Advantages and disadvantages with respect to adopting FPGA without embedded processors will be discussed.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Landsat image registration for agricultural applications

NASA Technical Reports Server (NTRS)

Wolfe, R. H., Jr.; Juday, R. D.; Wacker, A. G.; Kaneko, T.

1982-01-01

An image registration system has been developed at the NASA Johnson Space Center (JSC) to spatially align multi-temporal Landsat acquisitions for use in agriculture and forestry research. Working in conjunction with the Master Data Processor (MDP) at the Goddard Space Flight Center, it functionally replaces the long-standing LACIE Registration Processor as JSC's data supplier. The system represents an expansion of the techniques developed for the MDP and LACIE Registration Processor, and it utilizes the experience gained in an IBM/JSC effort evaluating the performance of the latter. These techniques are discussed in detail. Several tests were developed to evaluate the registration performance of the system. The results indicate that 1/15-pixel accuracy (about 4m for Landsat MSS) is achievable in ideal circumstances, sub-pixel accuracy (often to 0.2 pixel or better) was attained on a representative set of U.S. acquisitions, and a success rate commensurate with the LACIE Registration Processor was realized. The system has been employed in a production mode on U.S. and foreign data, and a performance similar to the earlier tests has been noted.
A SOPC-BASED Evaluation of AES for 2.4 GHz Wireless Network

NASA Astrophysics Data System (ADS)

Ken, Cai; Xiaoying, Liang

In modern systems, data security is needed more than ever before and many cryptographic algorithms are utilized for security services. Wireless Sensor Networks (WSN) is an example of such technologies. In this paper an innovative SOPC-based approach for the security services evaluation in WSN is proposed that addresses the issues of scalability, flexible performance, and silicon efficiency for the hardware acceleration of encryption system. The design includes a Nios II processor together with custom designed modules for the Advanced Encryption Standard (AES) which has become the default choice for various security services in numerous applications. The objective of this mechanism is to present an efficient hardware realization of AES using very high speed integrated circuit hardware description language (Verilog HDL) and expand the usability for various applications. As compared to traditional customize processor design, the mechanism provides a very broad range of cost/performance points.
Knowledge-based vision for space station object motion detection, recognition, and tracking

NASA Technical Reports Server (NTRS)

Symosek, P.; Panda, D.; Yalamanchili, S.; Wehner, W., III

1987-01-01

Computer vision, especially color image analysis and understanding, has much to offer in the area of the automation of Space Station tasks such as construction, satellite servicing, rendezvous and proximity operations, inspection, experiment monitoring, data management and training. Knowledge-based techniques improve the performance of vision algorithms for unstructured environments because of their ability to deal with imprecise a priori information or inaccurately estimated feature data and still produce useful results. Conventional techniques using statistical and purely model-based approaches lack flexibility in dealing with the variabilities anticipated in the unstructured viewing environment of space. Algorithms developed under NASA sponsorship for Space Station applications to demonstrate the value of a hypothesized architecture for a Video Image Processor (VIP) are presented. Approaches to the enhancement of the performance of these algorithms with knowledge-based techniques and the potential for deployment of highly-parallel multi-processor systems for these algorithms are discussed.
Real time processor for array speckle interferometry

NASA Astrophysics Data System (ADS)

Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

1989-02-01

The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.
Real time processor for array speckle interferometry

NASA Technical Reports Server (NTRS)

Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

1989-01-01

The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.
The emerging conceptualization of groups as information processors.

PubMed

Hinsz, V B; Tindale, R S; Vollrath, D A

1997-01-01

A selective review of research highlights the emerging view of groups as information processors. In this review, the authors include research on processing objectives, attention, encoding, storage, retrieval, processing, response, feedback, and learning in small interacting task groups. The groups as information processors perspective underscores several characteristic dimensions of variability in group performance of cognitive tasks, namely, commonality-uniqueness of information, convergence-diversity of ideas, accentuation-attenuation of cognitive processes, and belongingness-distinctiveness of members. A combination of contributions framework provides an additional conceptualization of information processing in groups. The authors also address implications, caveats, and questions for future research and theory regarding groups as information processors.
Image Processor Electronics (IPE): The High-Performance Computing System for NASA SWIFT Mission

NASA Technical Reports Server (NTRS)

Nguyen, Quang H.; Settles, Beverly A.

2003-01-01

Gamma Ray Bursts (GRBs) are believed to be the most powerful explosions that have occurred in the Universe since the Big Bang and are a mystery to the scientific community. Swift, a NASA mission that includes international participation, was designed and built in preparation for a 2003 launch to help to determine the origin of Gamma Ray Bursts. Locating the position in the sky where a burst originates requires intensive computing, because the duration of a GRB can range between a few milliseconds up to approximately a minute. The instrument data system must constantly accept multiple images representing large regions of the sky that are generated by sixteen gamma ray detectors operating in parallel. It then must process the received images very quickly in order to determine the existence of possible gamma ray bursts and their locations. The high-performance instrument data computing system that accomplishes this is called the Image Processor Electronics (IPE). The IPE was designed, built and tested by NASA Goddard Space Flight Center (GSFC) in order to meet these challenging requirements. The IPE is a small size, low power and high performing computing system for space applications. This paper addresses the system implementation and the system hardware architecture of the IPE. The paper concludes with the IPE system performance that was measured during end-to-end system testing.
Robust, High-Speed Network Design for Large-Scale Multiprocessing

DTIC Science & Technology

1993-09-01

3.17 Left: Non-expansive Wiring of Processors to First Stage Routing Elements . ... 38 3.18 Right: Expansive Wiring of Processors to First Stage...162 8.2 RNI Micro -architecture ........ .............................. 163 8.3 Packaged RN I IC...169 11.1 MLUNK Message Formats ........ .............................. 173 12.1 Routing Board Arrangement for 64- processor Machine
Evaluation of the Sentinel-3 Hydrologic Altimetry Processor prototypE (SHAPE) methods.

NASA Astrophysics Data System (ADS)

Benveniste, J.; Garcia-Mondéjar, A.; Bercher, N.; Fabry, P. L.; Roca, M.; Varona, E.; Fernandes, J.; Lazaro, C.; Vieira, T.; David, G.; Restano, M.; Ambrózio, A.

2017-12-01

Inland water scenes are highly variable, both in space and time, which leads to a much broader range of radar signatures than ocean surfaces. This applies to both LRM and "SAR" mode (SARM) altimetry. Nevertheless the enhanced along-track resolution of SARM altimeters should help improve the accuracy and precision of inland water height measurements from satellite. The SHAPE project - Sentinel-3 Hydrologic Altimetry Processor prototypE - which is funded by ESA through the Scientific Exploitation of Operational Missions Programme Element (contract number 4000115205/15/I-BG) aims at preparing for the exploitation of Sentinel-3 data over the inland water domain. The SHAPE Processor implements all of the steps necessary to derive rivers and lakes water levels and discharge from Delay-Doppler Altimetry and perform their validation against in situ data. The processor uses FBR CryoSat-2 and L1A Sentinel-3A data as input and also various ancillary data (proc. param., water masks, L2 corrections, etc.), to produce surface water levels. At a later stage, water level data are assimilated into hydrological models to derive river discharge. This poster presents the improvements obtained with the new methods and algorithms over the regions of interest (Amazon and Danube rivers, Vanern and Titicaca lakes).
Domain Wall Fermion Inverter on Pentium 4

NASA Astrophysics Data System (ADS)

Pochinsky, Andrew

2005-03-01

A highly optimized domain wall fermion inverter has been developed as part of the SciDAC lattice initiative. By designing the code to minimize memory bus traffic, it achieves high cache reuse and performance in excess of 2 GFlops for out of L2 cache problem sizes on a GigE cluster with 2.66 GHz Xeon processors. The code uses the SciDAC QMP communication library.
Baseband processor development for the Advanced Communications Satellite Program

NASA Technical Reports Server (NTRS)

Moat, D.; Sabourin, D.; Stilwell, J.; Mccallister, R.; Borota, M.

1982-01-01

An onboard-baseband-processor concept for a satellite-switched time-division-multiple-access (SS-TDMA) communication system was developed for NASA Lewis Research Center. The baseband processor routes and controls traffic on an individual message basis while providing significant advantages in improved link margins and system flexibility. Key technology developments required to prove the flight readiness of the baseband-processor design are being verified in a baseband-processor proof-of-concept model. These technology developments include serial MSK modems, Clos-type baseband routing switch, a single-chip CMOS maximum-likelihood convolutional decoder, and custom LSL implementation of high-speed, low-power ECL building blocks.
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
Design, characterization and control of the Unique Mobility Corporation robot

NASA Technical Reports Server (NTRS)

Velasco, Virgilio B., Jr.; Newman, Wyatt S.; Steinetz, Bruce; Kopf, Carlo; Malik, John

1994-01-01

Space and mass are at a premium on any space mission, and thus any machinery designed for space use should be lightweight and compact, without sacrificing strength. It is for this reason that NASA/LeRC contracted Unique Mobility Corporation to exploit their novel actuator designs to build a robot that would advance the present state of technology with respect to these requirements. Custom-designed motors are the key feature of this robot. They are compact, high-performance dc brushless servo motors with a high pole count and low inductance, thus permitting high torque generation and rapid phase commutation. Using a custom-designed digital signal processor-based controller board, the pulse width modulation power amplifiers regulate the fast dynamics of the motor currents. In addition, the programmable digital signal processor (DSP) controller permits implementation of nonlinear compensation algorithms to account for motoring vs. regeneration, torque ripple, and back-EMF. As a result, the motors produce a high torque relative to their size and weight, and can do so with good torque regulation and acceptably high velocity saturation limits. This paper presents the Unique Mobility Corporation robot prototype: its actuators, its kinematic design, its control system, and its experimental characterization. Performance results, including saturation torques, saturation velocities and tracking accuracy tests are included.
Binaural Speech Understanding With Bilateral Cochlear Implants in Reverberation.

PubMed

Kokkinakis, Kostas

2018-03-08

The purpose of this study was to investigate whether bilateral cochlear implant (CI) listeners who are fitted with clinical processors are able to benefit from binaural advantages under reverberant conditions. Another aim of this contribution was to determine whether the magnitude of each binaural advantage observed inside a highly reverberant environment differs significantly from the magnitude measured in a near-anechoic environment. Ten adults with postlingual deafness who are bilateral CI users fitted with either Nucleus 5 or Nucleus 6 clinical sound processors (Cochlear Corporation) participated in this study. Speech reception thresholds were measured in sound field and 2 different reverberation conditions (0.06 and 0.6 s) as a function of the listening condition (left, right, both) and the noise spatial location (left, front, right). The presence of the binaural effects of head-shadow, squelch, summation, and spatial release from masking in the 2 different reverberation conditions tested was determined using nonparametric statistical analysis. In the bilateral population tested, when the ambient reverberation time was equal to 0.6 s, results indicated strong positive effects of head-shadow and a weaker spatial release from masking advantage, whereas binaural squelch and summation contributed no statistically significant benefit to bilateral performance under this acoustic condition. These findings are consistent with those of previous studies, which have demonstrated that head-shadow yields the most pronounced advantage in noise. The finding that spatial release from masking produced little to almost no benefit in bilateral listeners is consistent with the hypothesis that additive reverberation degrades spatial cues and negatively affects binaural performance. The magnitude of 4 different binaural advantages was measured on the same group of bilateral CI subjects fitted with clinical processors in 2 different reverberation conditions. The results of this work demonstrate the impeding properties of reverberation on binaural speech understanding. In addition, results indicate that CI recipients who struggle in everyday listening environments are also more likely to benefit less in highly reverberant environments from their bilateral processors.
High Performance Computing at NASA

NASA Technical Reports Server (NTRS)

Bailey, David H.; Cooper, D. M. (Technical Monitor)

1994-01-01

The speaker will give an overview of high performance computing in the U.S. in general and within NASA in particular, including a description of the recently signed NASA-IBM cooperative agreement. The latest performance figures of various parallel systems on the NAS Parallel Benchmarks will be presented. The speaker was one of the authors of the NAS (National Aerospace Standards) Parallel Benchmarks, which are now widely cited in the industry as a measure of sustained performance on realistic high-end scientific applications. It will be shown that significant progress has been made by the highly parallel supercomputer industry during the past year or so, with several new systems, based on high-performance RISC processors, that now deliver superior performance per dollar compared to conventional supercomputers. Various pitfalls in reporting performance will be discussed. The speaker will then conclude by assessing the general state of the high performance computing field.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

PubMed Central

2010-01-01

Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics. PMID:20064262
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Scan line graphics generation on the massively parallel processor

NASA Technical Reports Server (NTRS)

Dorband, John E.

1988-01-01

Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

Lenslet array processors.

PubMed

Glaser, I

1982-04-01

By combining a lenslet array with masks it is possible to obtain a noncoherent optical processor capable of computing in parallel generalized 2-D discrete linear transformations. We present here an analysis of such lenslet array processors (LAP). The effect of several errors, including optical aberrations, diffraction, vignetting, and geometrical and mask errors, are calculated, and guidelines to optical design of LAP are derived. Using these results, both ultimate and practical performances of LAP are compared with those of competing techniques.
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
A diesel fuel processor for fuel-cell-based auxiliary power unit applications

NASA Astrophysics Data System (ADS)

Samsun, Remzi Can; Krekel, Daniel; Pasel, Joachim; Prawitz, Matthias; Peters, Ralf; Stolten, Detlef

2017-07-01

Producing a hydrogen-rich gas from diesel fuel enables the efficient generation of electricity in a fuel-cell-based auxiliary power unit. In recent years, significant progress has been achieved in diesel reforming. One issue encountered is the stable operation of water-gas shift reactors with real reformates. A new fuel processor is developed using a commercial shift catalyst. The system is operated using optimized start-up and shut-down strategies. Experiments with diesel and kerosene fuels show slight performance drops in the shift reactor during continuous operation for 100 h. CO concentrations much lower than the target value are achieved during system operation in auxiliary power unit mode at partial loads of up to 60%. The regeneration leads to full recovery of the shift activity. Finally, a new operation strategy is developed whereby the gas hourly space velocity of the shift stages is re-designed. This strategy is validated using different diesel and kerosene fuels, showing a maximum CO concentration of 1.5% at the fuel processor outlet under extreme conditions, which can be tolerated by a high-temperature PEFC. The proposed operation strategy solves the issue of strong performance drop in the shift reactor and makes this technology available for reducing emissions in the transportation sector.
Fast and robust control of nanopositioning systems: Performance limits enabled by field programmable analog arrays.

PubMed

Baranwal, Mayank; Gorugantu, Ram S; Salapaka, Srinivasa M

2015-08-01

This paper aims at control design and its implementation for robust high-bandwidth precision (nanoscale) positioning systems. Even though modern model-based control theoretic designs for robust broadband high-resolution positioning have enabled orders of magnitude improvement in performance over existing model independent designs, their scope is severely limited by the inefficacies of digital implementation of the control designs. High-order control laws that result from model-based designs typically have to be approximated with reduced-order systems to facilitate digital implementation. Digital systems, even those that have very high sampling frequencies, provide low effective control bandwidth when implementing high-order systems. In this context, field programmable analog arrays (FPAAs) provide a good alternative to the use of digital-logic based processors since they enable very high implementation speeds, moreover with cheaper resources. The superior flexibility of digital systems in terms of the implementable mathematical and logical functions does not give significant edge over FPAAs when implementing linear dynamic control laws. In this paper, we pose the control design objectives for positioning systems in different configurations as optimal control problems and demonstrate significant improvements in performance when the resulting control laws are applied using FPAAs as opposed to their digital counterparts. An improvement of over 200% in positioning bandwidth is achieved over an earlier digital signal processor (DSP) based implementation for the same system and same control design, even when for the DSP-based system, the sampling frequency is about 100 times the desired positioning bandwidth.
Potential of minicomputer/array-processor system for nonlinear finite-element analysis

NASA Technical Reports Server (NTRS)

Strohkorb, G. A.; Noor, A. K.

1983-01-01

The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Parallel matrix multiplication on the Connection Machine

NASA Technical Reports Server (NTRS)

Tichy, Walter F.

1988-01-01

Matrix multiplication is a computation and communication intensive problem. Six parallel algorithms for matrix multiplication on the Connection Machine are presented and compared with respect to their performance and processor usage. For n by n matrices, the algorithms have theoretical running times of O(n to the 2nd power log n), O(n log n), O(n), and O(log n), and require n, n to the 2nd power, n to the 2nd power, and n to the 3rd power processors, respectively. With careful attention to communication patterns, the theoretically predicted runtimes can indeed be achieved in practice. The parallel algorithms illustrate the tradeoffs between performance, communication cost, and processor usage.
Error recovery in shared memory multiprocessors using private caches

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1990-01-01

The problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
Multiprocessor switch with selective pairing

DOEpatents

Gara, Alan; Gschwind, Michael K; Salapura, Valentina

2014-03-11

System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus
Optimization strategies for molecular dynamics programs on Cray computers and scalar work stations

NASA Astrophysics Data System (ADS)

Unekis, Michael J.; Rice, Betsy M.

1994-12-01

We present results of timing runs and different optimization strategies for a prototype molecular dynamics program that simulates shock waves in a two-dimensional (2-D) model of a reactive energetic solid. The performance of the program may be improved substantially by simple changes to the Fortran or by employing various vendor-supplied compiler optimizations. The optimum strategy varies among the machines used and will vary depending upon the details of the program. The effect of various compiler options and vendor-supplied subroutine calls is demonstrated. Comparison is made between two scalar workstations (IBM RS/6000 Model 370 and Model 530) and several Cray supercomputers (X-MP/48, Y-MP8/128, and C-90/16256). We find that for a scientific application program dominated by sequential, scalar statements, a relatively inexpensive high-end work station such as the IBM RS/60006 RISC series will outperform single processor performance of the Cray X-MP/48 and perform competitively with single processor performance of the Y-MP8/128 and C-9O/16256.
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Parallel processor for real-time structural control

NASA Astrophysics Data System (ADS)

Tise, Bert L.

1993-07-01

A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-to-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection to host computer, parallelizing code generator, and look- up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating- point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An OpenWindows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.
Variations and relationship of quality and NIR spectral characteristics of cotton fibers collected from multi-location field performance trials

USDA-ARS?s Scientific Manuscript database

High volume instrumentation (HVITM) and advanced fiber information system (AFIS) measurements are increasingly being utilized as primary and routine means of acquiring fiber quality data by cotton breeders and fiber processors. There is amount of information regarding fiber and yarn qualities, but l...
Real-time support for high performance aircraft operation

NASA Technical Reports Server (NTRS)

Vidal, Jacques J.

1989-01-01

The feasibility of real-time processing schemes using artificial neural networks (ANNs) is investigated. A rationale for digital neural nets is presented and a general processor architecture for control applications is illustrated. Research results on ANN structures for real-time applications are given. Research results on ANN algorithms for real-time control are also shown.
Hardware for Accelerating N-Modular Redundant Systems for High-Reliability Computing

NASA Technical Reports Server (NTRS)

Dobbs, Carl, Sr.

2012-01-01

A hardware unit has been designed that reduces the cost, in terms of performance and power consumption, for implementing N-modular redundancy (NMR) in a multiprocessor device. The innovation monitors transactions to memory, and calculates a form of sumcheck on-the-fly, thereby relieving the processors of calculating the sumcheck in software
Video signal processing system uses gated current mode switches to perform high speed multiplication and digital-to-analog conversion

NASA Technical Reports Server (NTRS)

Gilliland, M. G.; Rougelot, R. S.; Schumaker, R. A.

1966-01-01

Video signal processor uses special-purpose integrated circuits with nonsaturating current mode switching to accept texture and color information from a digital computer in a visual spaceflight simulator and to combine these, for display on color CRT with analog information concerning fading.
Implementation of a level 1 trigger system using high speed serial (VXS) techniques for the 12GeV high luminosity experimental programs at Thomas Jefferson National Accelerator Facility

DOE Office of Scientific and Technical Information (OSTI.GOV)

C. Cuevas, B. Raydo, H. Dong, A. Gupta, F.J. Barbosa, J. Wilson, W.M. Taylor, E. Jastrzembski, D. Abbott

We will demonstrate a hardware and firmware solution for a complete fully pipelined multi-crate trigger system that takes advantage of the elegant high speed VXS serial extensions for VME. This trigger system includes three sections starting with the front end crate trigger processor (CTP), a global Sub-System Processor (SSP) and a Trigger Supervisor that manages the timing, synchronization and front end event readout. Within a front end crate, trigger information is gathered from each 16 Channel, 12 bit Flash ADC module at 4 nS intervals via the VXS backplane, to a Crate Trigger Processor (CTP). Each Crate Trigger Processor receivesmore » these 500 MB/S VXS links from the 16 FADC-250 modules, aligns skewed data inherent of Aurora protocol, and performs real time crate level trigger algorithms. The algorithm results are encoded using a Reed-Solomon technique and transmission of this Level 1 trigger data is sent to the SSP using a multi-fiber link. The multi-fiber link achieves an aggregate trigger data transfer rate to the global trigger at 8 Gb/s. The SSP receives and decodes Reed-Solomon error correcting transmission from each crate, aligns the data, and performs the global level trigger algorithms. The entire trigger system is synchronous and operates at 250 MHz with the Trigger Supervisor managing not only the front end event readout, but also the distribution of the critical timing clocks, synchronization signals, and the global trigger signals to each front end readout crate. These signals are distributed to the front end crates on a separate fiber link and each crate is synchronized using a unique encoding scheme to guarantee that each front end crate is synchronous with a fixed latency, independent of the distance between each crate. The overall trigger signal latency is <3 uS, and the proposed 12GeV experiments at Jefferson Lab require up to 200KHz Level 1 trigger rate.« less
A Simple and Affordable TTL Processor for the Classroom

ERIC Educational Resources Information Center

Feinberg, Dave

2007-01-01

This paper presents a simple 4 bit computer processor design that may be built using TTL chips for less than $65. In addition to describing the processor itself in detail, we discuss our experience using the laboratory kit and its associated machine instruction set to teach computer architecture to high school students. (Contains 3 figures and 5…
Benefit of the UltraZoom beamforming technology in noise in cochlear implant users.

PubMed

Mosnier, Isabelle; Mathias, Nathalie; Flament, Jonathan; Amar, Dorith; Liagre-Callies, Amelie; Borel, Stephanie; Ambert-Dahan, Emmanuèle; Sterkers, Olivier; Bernardeschi, Daniele

2017-09-01

The objectives of the study were to demonstrate the audiological and subjective benefits of the adaptive UltraZoom beamforming technology available in the Naída CI Q70 sound processor, in cochlear-implanted adults upgraded from a previous generation sound processor. Thirty-four adults aged between 21 and 89 years (mean 53 ± 19) were prospectively included. Nine subjects were unilaterally implanted, 11 bilaterally and 14 were bimodal users. The mean duration of cochlear implant use was 7 years (range 5-15 years). Subjects were tested in quiet with monosyllabic words and in noise with the adaptive French Matrix test in the best-aided conditions. The test setup contained a signal source in front of the subject and three noise sources at +/-90° and 180°. The noise was presented at a fixed level of 65 dB SPL and the level of speech signal was varied to obtain the speech reception threshold (SRT). During the upgrade visit, subjects were tested with the Harmony and with the Naída CI sound processors in omnidirectional microphone configuration. After a take-home phase of 2 months, tests were repeated with the Naída CI processor with and without UltraZoom. Subjective assessment of the sound quality in daily environments was recorded using the APHAB questionnaire. No difference in performance was observed in quiet between the two processors. The Matrix test in noise was possible in the 21 subjects with the better performance. No difference was observed between the two processors for performance in noise when using the omnidirectional microphone. At the follow-up session, the median SRT with the Naída CI processor with UltraZoom was -4 dB compared to -0.45 dB without UltraZoom. The use of UltraZoom improved the median SRT by 3.6 dB (p < 0.0001, Wilcoxon paired test). When looking at the APHAB outcome, improvement was observed for speech understanding in noisy environments (p < 0.01) and in aversive situations (p < 0.05) in the group of 21 subjects who were able to perform the Matrix test in noise and for speech understanding in noise (p < 0.05) in the group of 13 subjects with the poorest performance, who were not able to perform the Matrix test in noise. The use of UltraZoom beamforming technology, available on the new sound processor Naída CI, improves speech performance in difficult and realistic noisy conditions when the cochlear implant user needs to focus on the person speaking at the front. Using the APHAB questionnaire, a subjective benefit for listening in background noise was also observed in subjects with good performance as well as in those with poor performance. This study highlighted the importance of upgrading CI recipients to new technology and to include assessment in noise and subjective feedback evaluation as part of the process.

Multiprocessor architectural study

NASA Technical Reports Server (NTRS)

Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.

1972-01-01

An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
Performance Models for Split-execution Computing Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; McCaskey, Alex; Schrock, Jonathan

Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardwaremore » limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.« less
Efficacy of Code Optimization on Cache-Based Processors

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
Efficient parallel architecture for highly coupled real-time linear system applications

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo

1988-01-01

A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

NASA Astrophysics Data System (ADS)

Debes, Eric; Kaine, Greg

2002-11-01

In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Implicit, nonswitching, vector-oriented algorithm for steady transonic flow

NASA Technical Reports Server (NTRS)

Lottati, I.

1983-01-01

A rapid computation of a sequence of transonic flow solutions has to be performed in many areas of aerodynamic technology. The employment of low-cost vector array processors makes the conduction of such calculations economically feasible. However, for a full utilization of the new hardware, the developed algorithms must take advantage of the special characteristics of the vector array processor. The present investigation has the objective to develop an efficient algorithm for solving transonic flow problems governed by mixed partial differential equations on an array processor.
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions

DTIC Science & Technology

2014-05-01

processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
A wideband software reconfigurable modem

NASA Astrophysics Data System (ADS)

Turner, J. H., Jr.; Vickers, H.

A wideband modem is described which provides signal processing capability for four Lx-band signals employing QPSK, MSK and PPM waveforms and employs a software reconfigurable architecture for maximum system flexibility and graceful degradation. The current processor uses a 2901 and two 8086 microprocessors per channel and performs acquisition, tracking, and data demodulation for JITDS, GPS, IFF and TACAN systems. The next generation processor will be implemented using a VHSIC chip set employing a programmable complex array vector processor module, a GP computer module, customized gate array modules, and a digital array correlator. This integrated processor has application to a wide number of diverse system waveforms, and will bring the benefits of VHSIC technology insertion into avionic antijam communications systems.
An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

NASA Astrophysics Data System (ADS)

Liu, Fenglai; Kong, Jing

2018-07-01

Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
Performance Study of Earth Networks Total Lightning Network using Rocket-Triggered Lightning Data in 2014

NASA Astrophysics Data System (ADS)

Heckman, S.

2015-12-01

Modern lightning locating systems (LLS) provide real-time monitoring and early warning of lightningactivities. In addition, LLS provide valuable data for statistical analysis in lightning research. It isimportant to know the performance of such LLS. In the present study, the performance of the EarthNetworks Total Lightning Network (ENTLN) is studied using rocket-triggered lightning data acquired atthe International Center for Lightning Research and Testing (ICLRT), Camp Blanding, Florida.In the present study, 18 flashes triggered at ICLRT in 2014 were analyzed and they comprise of 78negative cloud-to-ground return strokes. The geometric mean, median, minimum, and maximum for thepeak currents of the 78 return strokes are 13.4 kA, 13.6 kA, 3.7 kA, and 38.4 kA, respectively. The peakcurrents represent typical subsequent return strokes in natural cloud-to-ground lightning.Earth Networks has developed a new data processor to improve the performance of their network. Inthis study, results are presented for the ENTLN data using the old processor (originally reported in 2014)and the ENTLN data simulated using the new processor. The flash detection efficiency, stroke detectionefficiency, percentage of misclassification, median location error, median peak current estimation error,and median absolute peak current estimation error for the originally reported data from old processorare 100%, 94%, 49%, 271 m, 5%, and 13%, respectively, and those for the simulated data using the newprocessor are 100%, 99%, 9%, 280 m, 11%, and 15%, respectively. The use of new processor resulted inhigher stroke detection efficiency and lower percentage of misclassification. It is worth noting that theslight differences in median location error, median peak current estimation error, and median absolutepeak current estimation error for the two processors are due to the fact that the new processordetected more number of return strokes than the old processor.
Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

1994-01-01

The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis

NASA Technical Reports Server (NTRS)

Lim, Chieng-Fai

1991-01-01

The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.
Automated Sequence Processor: Something Old, Something New

NASA Technical Reports Server (NTRS)

Streiffert, Barbara; Schrock, Mitchell; Fisher, Forest; Himes, Terry

2012-01-01

High productivity required for operations teams to meet schedules Risk must be minimized. Scripting used to automate processes. Scripts perform essential operations functions. Automated Sequence Processor (ASP) was a grass-roots task built to automate the command uplink process System engineering task for ASP revitalization organized. ASP is a set of approximately 200 scripts written in Perl, C Shell, AWK and other scripting languages.. ASP processes/checks/packages non-interactive commands automatically.. Non-interactive commands are guaranteed to be safe and have been checked by hardware or software simulators.. ASP checks that commands are non-interactive.. ASP processes the commands through a command. simulator and then packages them if there are no errors.. ASP must be active 24 hours/day, 7 days/week..
A site oriented supercomputer for theoretical physics: The Fermilab Advanced Computer Program Multi Array Processor System (ACMAPS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nash, T.; Atac, R.; Cook, A.

1989-03-06

The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
A seismic signal processor suitable for use with the NOAA/GOES satellite data collection system

NASA Technical Reports Server (NTRS)

Webster, W. J., Jr.; Miller, W. H.; Whitley, R.; Allenby, R. J.; Dennison, R. T.

1981-01-01

Because of the high data-rate requirements, a practical system capable of collecting seismic information in the field and relaying it, via satellite, to a central collection point is not yet available. A seismic signal processor has been developed and tested for use with the NOAA/GOES satellite data collection system. Performance tests on recorded, as well as real time, short period signals indicate that the event recognition technique used is nearly perfect in its rejection of environmental noise and other non-seismic signals and that, with the use of solid state buffer memories, data can be acquired in many swarm situations. The design of a complete field data collection platform is discussed based on the prototype evaluation.
Optoelectronic switch matrix as a look-up table for residue arithmetic.

PubMed

Macdonald, R I

1987-10-01

The use of optoelectronic matrix switches to perform look-up table functions in residue arithmetic processors is proposed. In this application, switchable detector arrays give the advantage of a greatly reduced requirement for optical sources by comparison with previous optoelectronic residue processors.
Hybridization of biomedical circuitry

NASA Technical Reports Server (NTRS)

Rinard, G. A.

1978-01-01

The design and fabrication of low power hybrid circuits to perform vital signs monitoring are reported. The circuits consist of: (1) clock; (2) ECG amplifier and cardiotachometer signal conditioner; (3) impedance pneumobraph and respiration rate processor; (4) hear/breath rate processor; (5) temperature monitor; and (6) LCD display.
Method and apparatus for high speed data acquisition and processing

DOEpatents

Ferron, J.R.

1997-02-11

A method and apparatus are disclosed for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register. 15 figs.
Method and apparatus for high speed data acquisition and processing

DOEpatents

Ferron, John R.

1997-01-01

A method and apparatus for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register.

Environmentally adaptive processing for shallow ocean applications: A sequential Bayesian approach.

PubMed

Candy, J V

2015-09-01

The shallow ocean is a changing environment primarily due to temperature variations in its upper layers directly affecting sound propagation throughout. The need to develop processors capable of tracking these changes implies a stochastic as well as an environmentally adaptive design. Bayesian techniques have evolved to enable a class of processors capable of performing in such an uncertain, nonstationary (varying statistics), non-Gaussian, variable shallow ocean environment. A solution to this problem is addressed by developing a sequential Bayesian processor capable of providing a joint solution to the modal function tracking and environmental adaptivity problem. Here, the focus is on the development of both a particle filter and an unscented Kalman filter capable of providing reasonable performance for this problem. These processors are applied to hydrophone measurements obtained from a vertical array. The adaptivity problem is attacked by allowing the modal coefficients and/or wavenumbers to be jointly estimated from the noisy measurement data along with tracking of the modal functions while simultaneously enhancing the noisy pressure-field measurements.
Optical interconnection using polyimide waveguide for multichip module

NASA Astrophysics Data System (ADS)

Koyanagi, Mitsumasa

1996-01-01

We have developed a parallel processor system with 152 RISC processor chips specific for Monte-Carlo analysis. This system has the ring-bus architecture. The performance of several Gflops is expected in this system according to the computer simulation. However, it was revealed that the data transfer speed of the bus has to be increased more dramatically in order to further increase the performance. Then, we propose to introduce the optical interconnection into the parallel processor system to increase the data transfer speed of the buses. The double ringbus architecture is employed in this new parallel processor system with optical interconnection. The free-space optical interconnection arid the optical waveguide are used for the optical ring-bus. Thin polyimide film was used to form the optical waveguide. A relatively low propagation loss was achieved in the polyimide optical waveguide. In addition, it was confirmed that the propagation direction of signal light can be easily changed by using a micro-mirror.
Optical interconnection using polyimide waveguide for multichip module

NASA Astrophysics Data System (ADS)

Koyanagi, Mitsumasa

1996-01-01

We have developed a parallel processor system with 152 RISC processor chips specific for Monte-Carlo analysis. This system has the ring-bus architecture. The performance of several Gflops is expected in this system according to the computer simulation. However, it was revealed that the data transfer speed of the bus has to be increased more dramatically in order to further increase the performance. Then, we propose to introduce the optical interconnection into the parallel processor system to increase the data transfer speed of the buses. The double ring-bus architecture is employed in this new parallel processor system with optical interconnection. The free-space optical interconnection and the optical waveguide are used for the optical ring-bus. Thin polyimide film was used to form the optical waveguide. A relatively low propagation loss was achieved in the polyimide optical waveguide. In addition, it was confirmed that the propagation direction of signal light can be easily changed by using a micro-mirror.
Resource and Performance Evaluations of Fixed Point QRD-RLS Systolic Array through FPGA Implementation

NASA Astrophysics Data System (ADS)

Yokoyama, Yoshiaki; Kim, Minseok; Arai, Hiroyuki

At present, when using space-time processing techniques with multiple antennas for mobile radio communication, real-time weight adaptation is necessary. Due to the progress of integrated circuit technology, dedicated processor implementation with ASIC or FPGA can be employed to implement various wireless applications. This paper presents a resource and performance evaluation of the QRD-RLS systolic array processor based on fixed-point CORDIC algorithm with FPGA. In this paper, to save hardware resources, we propose the shared architecture of a complex CORDIC processor. The required precision of internal calculation, the circuit area for the number of antenna elements and wordlength, and the processing speed will be evaluated. The resource estimation provides a possible processor configuration with a current FPGA on the market. Computer simulations assuming a fading channel will show a fast convergence property with a finite number of training symbols. The proposed architecture has also been implemented and its operation was verified by beamforming evaluation through a radio propagation experiment.
Software-Reconfigurable Processors for Spacecraft

NASA Technical Reports Server (NTRS)

Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

2005-01-01

A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).
The design of an adaptive predictive coder using a single-chip digital signal processor

NASA Astrophysics Data System (ADS)

Randolph, M. A.

1985-01-01

A speech coding processor architecture design study has been performed in which Texas Instruments TMS32010 has been selected from among three commercially available digital signal processing integrated circuits and evaluated in an implementation study of real-time Adaptive Predictive Coding (APC). The TMS32010 has been compared with AR&T Bell Laboratories DSP I and Nippon Electric Co. PD7720 and was found to be most suitable for a single chip implementation of APC. A preliminary design system based on TMS32010 has been performed, and several of the hardware and software design issues are discussed. Particular attention was paid to the design of an external memory controller which permits rapid sequential access of external RAM. As a result, it has been determined that a compact hardware implementation of the APC algorithm is feasible based of the TSM32010. Originator-supplied keywords include: vocoders, speech compression, adaptive predictive coding, digital signal processing microcomputers, speech processor architectures, and special purpose processor.
Micromechanical Signal Processors

NASA Astrophysics Data System (ADS)

Nguyen, Clark Tu-Cuong

Completely monolithic high-Q micromechanical signal processors constructed of polycrystalline silicon and integrated with CMOS electronics are described. The signal processors implemented include an oscillator, a bandpass filter, and a mixer + filter--all of which are components commonly required for up- and down-conversion in communication transmitters and receivers, and all of which take full advantage of the high Q of micromechanical resonators. Each signal processor is designed, fabricated, then studied with particular attention to the performance consequences associated with miniaturization of the high-Q element. The fabrication technology which realizes these components merges planar integrated circuit CMOS technologies with those of polysilicon surface micromachining. The technologies are merged in a modular fashion, where the CMOS is processed in the first module, the microstructures in a following separate module, and at no point in the process sequence are steps from each module intermixed. Although the advantages of such modularity include flexibility in accommodating new module technologies, the developed process constrained the CMOS metallization to a high temperature refractory metal (tungsten metallization with TiSi _2 contact barriers) and constrained the micromachining process to long-term temperatures below 835^circC. Rapid-thermal annealing (RTA) was used to relieve residual stress in the mechanical structures. To reduce the complexity involved with developing this merged process, capacitively transduced resonators are utilized. High-Q single resonator and spring-coupled micromechanical resonator filters are also investigated, with particular attention to noise performance, bandwidth control, and termination design. The noise in micromechanical filters is found to be fairly high due to poor electromechanical coupling on the micro-scale with present-day technologies. Solutions to this high series resistance problem are suggested, including smaller electrode-to-resonator gaps to increase the coupling capacitance. Active Q-control techniques are demonstrated which control the bandwidth of micromechanical filters and simulate filter terminations with little passband distortion. Noise analysis shows that these active techniques are relatively quiet when compared with other resistive techniques. Modulation techniques are investigated whereby a single resonator or a filter constructed from several such resonators can provide both a mixing and a filtering function, or a filtering and amplitude modulation function. These techniques center around the placement of a carrier signal on the micromechanical resonator. Finally, micro oven stabilization is investigated in an attempt to null the temperature coefficient of a polysilicon micromechanical resonator. Here, surface micromachining procedures are utilized to fabricate a polysilicon resonator on a microplatform--two levels of suspension--equipped with heater and temperature sensing resistors, which are then imbedded in a feedback loop to control the platform (and resonator) temperature. (Abstract shortened by UMI.).
Nyquist-WDM filter shaping with a high-resolution colorless photonic spectral processor.

PubMed

Sinefeld, David; Ben-Ezra, Shalva; Marom, Dan M

2013-09-01

We employ a spatial-light-modulator-based colorless photonic spectral processor with a spectral addressability of 100 MHz along 100 GHz bandwidth, for multichannel, high-resolution reshaping of Gaussian channel response to square-like shape, compatible with Nyquist WDM requirements.
The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhou, Shujia; Duffy, Daniel; Clune, Thomas

The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less
A high-accuracy optical linear algebra processor for finite element applications

NASA Technical Reports Server (NTRS)

Casasent, D.; Taylor, B. K.

1984-01-01

Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.
Array processor architecture

NASA Technical Reports Server (NTRS)

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1983-01-01

A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
Ground Terminal Processor Interface Board for Skynet Uplink Synchronization Trials

DTIC Science & Technology

1997-11-01

I1 National DMfense Defence nationale GROUND TERMINAL PROCESSOR INTERFACE BOARD FOR SKYNET UPLINK SYNCHRONIZATION TRIALS by Caroline Tom 19980126...National D6fense Defence nationale GROUND TERMINAL PROCESSOR INTERFACE BOARD FOR SKYNET UPLINK SYNCHRONIZATION TRIALS by Caroline Tom MilSat...aspects of uplink synchronization for extremely-high-frequency (EHF) spread spectrum satellite communications (SATCOM). Requirements of the GT subsystem
Towards the formal specification of the requirements and design of a processor interface unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

Work to formally specify the requirements and design of a Processor Interface Unit (PIU), a single-chip subsystem providing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system, is described. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance free operation, or both. The approaches that were developed for modeling the PIU requirements and for composition of the PIU subcomponents at high levels of abstraction are described. These approaches were used to specify and verify a nontrivial subset of the PIU behavior. The PIU specification in Higher Order Logic (HOL) is documented in a companion NASA contractor report entitled 'Towards the Formal Specification of the Requirements and Design of a Processor Interfacs Unit - HOL Listings.' The subsequent verification approach and HOL listings are documented in NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit' and NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings.'
An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors

NASA Technical Reports Server (NTRS)

Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

2015-01-01

This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.
Solving very large, sparse linear systems on mesh-connected parallel computers

NASA Technical Reports Server (NTRS)

Opsahl, Torstein; Reif, John

1987-01-01

The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Safe and Efficient Support for Embeded Multi-Processors in ADA

NASA Astrophysics Data System (ADS)

Ruiz, Jose F.

2010-08-01

New software demands increasing processing power, and multi-processor platforms are spreading as the answer to achieve the required performance. Embedded real-time systems are also subject to this trend, but in the case of real-time mission-critical systems, the properties of reliability, predictability and analyzability are also paramount. The Ada 2005 language defined a subset of its tasking model, the Ravenscar profile, that provides the basis for the implementation of deterministic and time analyzable applications on top of a streamlined run-time system. This Ravenscar tasking profile, originally designed for single processors, has proven remarkably useful for modelling verifiable real-time single-processor systems. This paper proposes a simple extension to the Ravenscar profile to support multi-processor systems using a fully partitioned approach. The implementation of this scheme is simple, and it can be used to develop applications amenable to schedulability analysis.
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures

NASA Technical Reports Server (NTRS)

Ma, Kwan-Liu

1995-01-01

As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
Multiprocessing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1990-01-01

Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
System analysis of graphics processor architecture using virtual prototyping

NASA Astrophysics Data System (ADS)

Hancock, William R.; Groat, Jeff; Steeves, Todd; Spaanenburg, Henk; Shackleton, John

1995-06-01

Honeywell has been actively involved in the definition of the next generation display processors for military and commercial cockpits. A major concern is how to achieve super graphics workstation performance in avionics application. Most notable are requirements for low volume, low power, harsh environmental conditions, real-time performance and low cost. This paper describes the application of VHDL to the system analysis tasks associated with achieving these goals in a cost effective manner. The paper will describe the top level architecture identified to provide the graphical and video processing power needed to drive future high resolution display devices and to generate more natural panoramic 3D formats. The major discussion, however, will be on the use of VHDL to model the processing elements and customized pipelines needed to realize the architecture and for doing the complex system tradeoff studies necessary to achieve a cost effective implementation. New software tools have been developed to allow 'virtual' prototyping in the VHDL environment. This results in a hardware/software codesign using VHDL performance and functional models. This unique architectural tool allows simulation and tradeoffs within a standard and tightly integrated toolset, which eventually will be used to specify and design the entire system from the top level requirements and system performance to the lowest level individual ASICs. New processing elements, algorithms, and standard graphical inputs can be designed, tested and evaluated without the costly hardware prototyping using the innovative 'virtual' prototyping techniques which are evolving on this project. In addition, virtual prototyping of the display processor does not bind the preliminary design to point solutions as a physical prototype will. when the development schedule is known, one can extrapolate processing elements performance and design the system around the most current technology.
Scalable Unix commands for parallel processors : a high-performance implementation.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

Dataflow-Based Implementation of Layered Sensing Applications on High-Performance Embedded Processors

DTIC Science & Technology

2013-03-01

time (milliseconds) GFlops Comparison to GPU peak performance (%) Cascade Gaussian Filtering 13 45.19 6.3 Difference of Gaussian 0.512 152...values for the GPU-targeted actor implementations in terms of Giga Floating Point Operations Per Second ( GFLOPS ). Our GFLOPS calculation for an actor...kernels. The results for GFLOPS are provided in Table . The actors were implemented on an NVIDIA GTX260 GPU, which provides 715 GFLOPS as peak
Cell-NPE (Numerical Performance Evaluation): Programming the IBM Cell Broadband Engine -- A General Parallelization Strategy

DTIC Science & Technology

2008-04-01

Space GmbH as follows: B. TECHNICAL PRPOPOSA/DESCRIPTION OF WORK Cell: A Revolutionary High Performance Computing Platform On 29 June 2005 [1...IBM has announced that is has partnered with Mercury Computer Systems, a maker of specialized computers . The Cell chip provides massive floating-point...the computing industry away from the traditional processor technology dominated by Intel. While in the past, the development of computing power has
Parallel processor for real-time structural control

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tise, B.L.

1992-01-01

A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection tomore » host computer, parallelizing code generator, and look-up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating-point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An Open Windows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.« less
Exact diagonalization of quantum lattice models on coprocessors

NASA Astrophysics Data System (ADS)

Siro, T.; Harju, A.

2016-10-01

We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
WHAT'S NEW IN MCIP2?

EPA Science Inventory

An update to CMAQ's Meteorology/Chemistry Interface Processor Version 2 (MCIP2) will be released in August 2004 in conjunction with the next public release of the CMAQ model. MCIP2 is the pre-processor in the CMAQ system that is typically used to perform off-line linkage between...
7 CFR 252.4 - Application to participate and agreement.

Code of Federal Regulations, 2010 CFR

2010-01-01

... integrity, business ethics and performance. In addition, the processors must demonstrate their ability to sell end products under NCP by submitting supporting documentation such as written intent to purchase... purchased, the processor shall invoice the recipient agency at the net case price which shall reflect the...
Scalable parallel communications

NASA Technical Reports Server (NTRS)

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
Effective correlator for RadioAstron project

NASA Astrophysics Data System (ADS)

Sergeev, Sergey

This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
Design of a real-time wind turbine simulator using a custom parallel architecture

NASA Technical Reports Server (NTRS)

Hoffman, John A.; Gluck, R.; Sridhar, S.

1995-01-01

The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
Design of object-oriented distributed simulation classes

NASA Technical Reports Server (NTRS)

Schoeffler, James D. (Principal Investigator)

1995-01-01

Distributed simulation of aircraft engines as part of a computer aided design package is being developed by NASA Lewis Research Center for the aircraft industry. The project is called NPSS, an acronym for 'Numerical Propulsion Simulation System'. NPSS is a flexible object-oriented simulation of aircraft engines requiring high computing speed. It is desirable to run the simulation on a distributed computer system with multiple processors executing portions of the simulation in parallel. The purpose of this research was to investigate object-oriented structures such that individual objects could be distributed. The set of classes used in the simulation must be designed to facilitate parallel computation. Since the portions of the simulation carried out in parallel are not independent of one another, there is the need for communication among the parallel executing processors which in turn implies need for their synchronization. Communication and synchronization can lead to decreased throughput as parallel processors wait for data or synchronization signals from other processors. As a result of this research, the following have been accomplished. The design and implementation of a set of simulation classes which result in a distributed simulation control program have been completed. The design is based upon MIT 'Actor' model of a concurrent object and uses 'connectors' to structure dynamic connections between simulation components. Connectors may be dynamically created according to the distribution of objects among machines at execution time without any programming changes. Measurements of the basic performance have been carried out with the result that communication overhead of the distributed design is swamped by the computation time of modules unless modules have very short execution times per iteration or time step. An analytical performance model based upon queuing network theory has been designed and implemented. Its application to realistic configurations has not been carried out.
Design of Object-Oriented Distributed Simulation Classes

NASA Technical Reports Server (NTRS)

Schoeffler, James D.

1995-01-01

Distributed simulation of aircraft engines as part of a computer aided design package being developed by NASA Lewis Research Center for the aircraft industry. The project is called NPSS, an acronym for "Numerical Propulsion Simulation System". NPSS is a flexible object-oriented simulation of aircraft engines requiring high computing speed. It is desirable to run the simulation on a distributed computer system with multiple processors executing portions of the simulation in parallel. The purpose of this research was to investigate object-oriented structures such that individual objects could be distributed. The set of classes used in the simulation must be designed to facilitate parallel computation. Since the portions of the simulation carried out in parallel are not independent of one another, there is the need for communication among the parallel executing processors which in turn implies need for their synchronization. Communication and synchronization can lead to decreased throughput as parallel processors wait for data or synchronization signals from other processors. As a result of this research, the following have been accomplished. The design and implementation of a set of simulation classes which result in a distributed simulation control program have been completed. The design is based upon MIT "Actor" model of a concurrent object and uses "connectors" to structure dynamic connections between simulation components. Connectors may be dynamically created according to the distribution of objects among machines at execution time without any programming changes. Measurements of the basic performance have been carried out with the result that communication overhead of the distributed design is swamped by the computation time of modules unless modules have very short execution times per iteration or time step. An analytical performance model based upon queuing network theory has been designed and implemented. Its application to realistic configurations has not been carried out.
Design and evaluation of an architecture for a digital signal processor for instrumentation applications

NASA Astrophysics Data System (ADS)

Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos

1990-03-01

The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
On-board fault diagnostics for fly-by-light flight control systems using neural network flight processors

NASA Astrophysics Data System (ADS)

Urnes, James M., Sr.; Cushing, John; Bond, William E.; Nunes, Steve

1996-10-01

Fly-by-Light control systems offer higher performance for fighter and transport aircraft, with efficient fiber optic data transmission, electric control surface actuation, and multi-channel high capacity centralized processing combining to provide maximum aircraft flight control system handling qualities and safety. The key to efficient support for these vehicles is timely and accurate fault diagnostics of all control system components. These diagnostic tests are best conducted during flight when all facts relating to the failure are present. The resulting data can be used by the ground crew for efficient repair and turnaround of the aircraft, saving time and money in support costs. These difficult to diagnose (Cannot Duplicate) fault indications average 40 - 50% of maintenance activities on today's fighter and transport aircraft, adding significantly to fleet support cost. Fiber optic data transmission can support a wealth of data for fault monitoring; the most efficient method of fault diagnostics is accurate modeling of the component response under normal and failed conditions for use in comparison with the actual component flight data. Neural Network hardware processors offer an efficient and cost-effective method to install fault diagnostics in flight systems, permitting on-board diagnostic modeling of very complex subsystems. Task 2C of the ARPA FLASH program is a design demonstration of this diagnostics approach, using the very high speed computation of the Adaptive Solutions Neural Network processor to monitor an advanced Electrohydrostatic control surface actuator linked through a AS-1773A fiber optic bus. This paper describes the design approach and projected performance of this on-line diagnostics system.
Investigation of a 2-step agglomeration process performed in a rotary processor using polyethylene glycol solutions as the primary binder liquid.

PubMed

Kristensen, Jakob

2006-10-27

The purpose of this research was to investigate the use of polyethylene glycol (PEG) solutions as the primary binder liquid in a 2-step agglomeration process performed in a rotary processor and characterize the resulting granules and their tableting characteristics. This was done by granulation of binary mixtures of microcrystalline cellulose (MCC) and either lactose, calcium phosphate, acetaminophen, or theophylline, in a 1:3 ratio, using a 50% (wt/wt) aqueous solution of PEG and water as the binder liquid. Formulations containing lactose were agglomerated using 5 different amounts of the PEG binder solution, giving rise to a PEG content in the range of 6% to 43% (wt/wt). The process outcome was characterized according to adhesion, yield, and water requirement, and the prepared granules were characterized according to size, size distribution, and flow properties as well as tableting properties. The agglomeration of all mixtures resulted in high yields of free-flowing agglomerates and gave rise to good reproducibility of the investigated agglomerate characteristics. The process allowed for the incorporation of 42.5% (wt/wt) PEG, which is higher than the percentage of PEG reported for other equipment. Tablets of sufficient strength could be prepared with all investigated excipients using 20% wt/wt PEG; higher PEG contents gave rise to adhesion and prolonged disintegration. In conclusion, agglomeration in a torque-controlled rotary processor using solutions of PEG as the primary binder liquid was found to be a robust process, suitable for the incorporation of high contents of PEG and/or drug compounds.
Design, Implementation, and Evaluation of a Virtual Shared Memory System in a Multi-Transputer Network.

DTIC Science & Technology

1987-12-01

Synchronization and Data Passing Mechanism ........ 50 4. System Shut Down .................................................................. 51 5...high performance, fault tolerance, and extensibility. These features are attained by synchronizing and coordinating the dis- tributed multicomputer... synchronizing all processors in the network. In a multitransputer network, processes that communicate with each other do so synchronously . This makes
Design and Demonstration of RSFQ Processor Datapath for High Performance Computing

DTIC Science & Technology

2014-09-30

Date(f-2): 1 b. InQubit, Inc. 21143 Hawthorne Blvd. #459 Torrance CA 90503 SF Memory Development University oF Illinois-Urbana Champagne 00000 University...oF Illinois-Urbana Champagne 00000 1 a. 1 a. 1 a. Inventions (DD882) Scientific Progress See Attachment Technology Transfer N Patent Filed in US? (5d
High Performance Active Database Management on a Shared-Nothing Parallel Processor

DTIC Science & Technology

1998-05-01

either stored or virtual. A stored node is like a materialized view. It actually contains the specified tuples. A virtual node is like a real view...90292-6695 DL-5 COLUMBIA UNIV/DEPT COMPUTER SCIENCi ATTN: OR GAIL £. KAISER 450 COMPUTER SCIENCE 3LDG 500 WEST 12ÖTH STRSET NEW YORK NY 10027
AFOSR BRI: Co-Design of Hardware/Software for Predicting MAV Aerodynamics

DTIC Science & Technology

2016-09-27

DOCUMENTATION PAGE Form ApprovedOMB No. 0704-0188 1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER 6. AUTHOR(S) 7...703-588-8494 AFOSR BRI While Moore’s Law theoretically doubles processor performance every 24 months, much of the realizable performance remains...past efforts to develop such CFD codes on accelerated processors showed limited success, our hardware/software co-design approach created malleable
Modeling Large Scale Circuits Using Massively Parallel Descrete-Event Simulation

DTIC Science & Technology

2013-06-01

exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power consumption...grow to exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power...Warp Speed 10.0. 2.0 INTRODUCTION As supercomputer systems approach exascale , the core count will exceed 1024 and number of transistors used in
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.

Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
Apparatus for and method of testing an electrical ground fault circuit interrupt device

DOEpatents

Andrews, L.B.

1998-08-18

An apparatus for testing a ground fault circuit interrupt device includes a processor, an input device connected to the processor for receiving input from an operator, a storage media connected to the processor for storing test data, an output device connected to the processor for outputting information corresponding to the test data to the operator, and a calibrated variable load circuit connected between the processor and the ground fault circuit interrupt device. The ground fault circuit interrupt device is configured to trip a corresponding circuit breaker. The processor is configured to receive signals from the calibrated variable load circuit and to process the signals to determine a trip threshold current and/or a trip time. A method of testing the ground fault circuit interrupt device includes a first step of providing an identification for the ground fault circuit interrupt device. Test data is then recorded in accordance with the identification. By comparing test data from an initial test with test data from a subsequent test, a trend of performance for the ground fault circuit interrupt device is determined. 17 figs.
Apparatus for and method of testing an electrical ground fault circuit interrupt device

DOEpatents

Andrews, Lowell B.

1998-01-01

An apparatus for testing a ground fault circuit interrupt device includes a processor, an input device connected to the processor for receiving input from an operator, a storage media connected to the processor for storing test data, an output device connected to the processor for outputting information corresponding to the test data to the operator, and a calibrated variable load circuit connected between the processor and the ground fault circuit interrupt device. The ground fault circuit interrupt device is configured to trip a corresponding circuit breaker. The processor is configured to receive signals from the calibrated variable load circuit and to process the signals to determine a trip threshold current and/or a trip time. A method of testing the ground fault circuit interrupt device includes a first step of providing an identification for the ground fault circuit interrupt device. Test data is then recorded in accordance with the identification. By comparing test data from an initial test with test data from a subsequent test, a trend of performance for the ground fault circuit interrupt device is determined.
Stream Processors

NASA Astrophysics Data System (ADS)

Erez, Mattan; Dally, William J.

Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
System support software for the Space Ultrareliable Modular Computer (SUMC)

NASA Technical Reports Server (NTRS)

Hill, T. E.; Hintze, G. C.; Hodges, B. C.; Austin, F. A.; Buckles, B. P.; Curran, R. T.; Lackey, J. D.; Payne, R. E.

1974-01-01

The highly transportable programming system designed and implemented to support the development of software for the Space Ultrareliable Modular Computer (SUMC) is described. The SUMC system support software consists of program modules called processors. The initial set of processors consists of the supervisor, the general purpose assembler for SUMC instruction and microcode input, linkage editors, an instruction level simulator, a microcode grid print processor, and user oriented utility programs. A FORTRAN 4 compiler is undergoing development. The design facilitates the addition of new processors with a minimum effort and provides the user quasi host independence on the ground based operational software development computer. Additional capability is provided to accommodate variations in the SUMC architecture without consequent major modifications in the initial processors.
Towards energy-efficient photonic interconnects

NASA Astrophysics Data System (ADS)

Demir, Yigit; Hardavellas, Nikos

2015-03-01

Silicon photonics have emerged as a promising solution to meet the growing demand for high-bandwidth, low-latency, and energy-efficient on-chip and off-chip communication in many-core processors. However, current silicon-photonic interconnect designs for many-core processors waste a significant amount of power because (a) lasers are always on, even during periods of interconnect inactivity, and (b) microring resonators employ heaters which consume a significant amount of power just to overcome thermal variations and maintain communication on the photonic links, especially in a 3D-stacked design. The problem of high laser power consumption is particularly important as lasers typically have very low energy efficiency, and photonic interconnects often remain underutilized both in scientific computing (compute-intensive execution phases underutilize the interconnect), and in server computing (servers in Google-scale datacenters have a typical utilization of less than 30%). We address the high laser power consumption by proposing EcoLaser+, which is a laser control scheme that saves energy by predicting the interconnect activity and opportunistically turning the on-chip laser off when possible, and also by scaling the width of the communication link based on a runtime prediction of the expected message length. Our laser control scheme can save up to 62 - 92% of the laser energy, and improve the energy efficiency of a manycore processor with negligible performance penalty. We address the high trimming (heating) power consumption of the microrings by proposing insulation methods that reduce the impact of localized heating induced by highly-active components on the 3D-stacked logic die.
VASP-4096: a very high performance programmable device for digital media processing applications

NASA Astrophysics Data System (ADS)

Krikelis, Argy

2001-03-01

Over the past few years, technology drivers for microprocessors have changed significantly. Media data delivery and processing--such as telecommunications, networking, video processing, speech recognition and 3D graphics--is increasing in importance and will soon dominate the processing cycles consumed in computer-based systems. This paper presents the architecture of the VASP-4096 processor. VASP-4096 provides high media performance with low energy consumption by integrating associative SIMD parallel processing with embedded microprocessor technology. The major innovations in the VASP-4096 is the integration of thousands of processing units in a single chip that are capable of support software programmable high-performance mathematical functions as well as abstract data processing. In addition to 4096 processing units, VASP-4096 integrates on a single chip a RISC controller that is an implementation of the SPARC architecture, 128 Kbytes of Data Memory, and I/O interfaces. The SIMD processing in VASP-4096 implements the ASProCore architecture, which is a proprietary implementation of SIMD processing, operates at 266 MHz with program instructions issued by the RISC controller. The device also integrates a 64-bit synchronous main memory interface operating at 133 MHz (double-data rate), and a 64- bit 66 MHz PCI interface. VASP-4096, compared with other processors architectures that support media processing, offers true performance scalability, support for deterministic and non-deterministic data processing on a single device, and software programmability that can be re- used in future chip generations.
Electroacoustic dewatering of food and other suspensions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, B.C.; Zelinski, M.S.; Criner, C.L.

1989-05-31

The food processing industry is a large user of energy for evaporative drying due to limited effectiveness of conventional mechanical dewatering machines. Battelle's Electroacoustic Dewatering (EAD) process improves the performance of mechanical dewatering machines by superimposing electric and ultrasonic fields. A two phase development program to demonstrate the benefits of EAD was carried out in cooperation with the food processing industry, the National Food Processors Association (NFPA) and two equipment vendors. In Phase I, laboratory scale studies were carried out on a variety of food suspensions. The process was scaled up to small commercial scale in Phase II. The technicalmore » feasibility of EAD for a variety of food materials, without adversely affecting the food properties, was successfully demonstrated during this phase, which is the subject of this report. Two Process Research Units (PRUs) were designed and built through joint efforts between Battelle and two equipment vendors. A 0.5-meter wide belt press was tested on apple mash, corn fiber, and corn gluten at sites provided by two food processors. A high speed citrus juice finisher (a hybrid form of screw press and centrifuge) was tested on orange pulp. These tests were carried out jointly by Battelle, equipment vendors, NFPA, and food processors. The apple and citrus juice products were analyzed by food processors and NFPA. 26 figs., 30 tabs.« less
Microlens array processor with programmable weight mask and direct optical input

NASA Astrophysics Data System (ADS)

Schmid, Volker R.; Lueder, Ernst H.; Bader, Gerhard; Maier, Gert; Siegordner, Jochen

1999-03-01

We present an optical feature extraction system with a microlens array processor. The system is suitable for online implementation of a variety of transforms such as the Walsh transform and DCT. Operating with incoherent light, our processor accepts direct optical input. Employing a sandwich- like architecture, we obtain a very compact design of the optical system. The key elements of the microlens array processor are a square array of 15 X 15 spherical microlenses on acrylic substrate and a spatial light modulator as transmissive mask. The light distribution behind the mask is imaged onto the pixels of a customized a-Si image sensor with adjustable gain. We obtain one output sample for each microlens image and its corresponding weight mask area as summation of the transmitted intensity within one sensor pixel. The resulting architecture is very compact and robust like a conventional camera lens while incorporating a high degree of parallelism. We successfully demonstrate a Walsh transform into the spatial frequency domain as well as the implementation of a discrete cosine transform with digitized gray values. We provide results showing the transformation performance for both synthetic image patterns and images of natural texture samples. The extracted frequency features are suitable for neural classification of the input image. Other transforms and correlations can be implemented in real-time allowing adaptive optical signal processing.
ROSE::FTTransform - A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lidman, J; Quinlan, D; Liao, C

2012-03-26

Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present anmore » open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).« less
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Tumeo, Antonino; Secchi, Simone

Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
50 CFR 679.52 - Observer provider permitting and responsibilities.

Code of Federal Regulations, 2014 CFR

2014-10-01

... samples from the observer's deployment by the completion of the electronic vessel and/or processor survey(s); (ii) Complete NMFS electronic vessel and/or processor surveys before performing other jobs or... or experimental fishing as defined in § 600.10 of this chapter. (2) Must not have a direct financial...
50 CFR 679.52 - Observer provider permitting and responsibilities.

Code of Federal Regulations, 2013 CFR

2013-10-01

... samples from the observer's deployment by the completion of the electronic vessel and/or processor survey(s); (ii) Complete NMFS electronic vessel and/or processor surveys before performing other jobs or... or experimental fishing as defined in § 600.10 of this chapter. (2) Must not have a direct financial...
Parallel eigenanalysis of finite element models in a completely connected architecture

NASA Technical Reports Server (NTRS)

Akl, F. A.; Morel, M. R.

1989-01-01

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
Power processor for a 30cm ion thruster

NASA Technical Reports Server (NTRS)

Biess, J. J.; Inouye, L. Y.

1974-01-01

A thermal vacuum power processor for the NASA Lewis 30cm Mercury Ion Engine was designed, fabricated and tested to determine compliance with electrical specifications. The power processor breadboard used the silicon controlled rectifier (SCR) series resonant inverter as the basic power stage to process all the power to an ion engine. The power processor includes a digital interface unit to process all input commands and internal telemetry signals so that operation is compatible with a central computer system. The breadboard was tested in a thermal vacuum environment. Integration tests were performed with the ion engine and demonstrate operational compatibility and reliable operation without any component failures. Electromagnetic interference data were also recorded on the design to provide information on the interaction with total spacecraft.
Stencils and problem partitionings: Their influence on the performance of multiple processor systems

NASA Technical Reports Server (NTRS)

Reed, D. A.; Adams, L. M.; Patrick, M. L.

1986-01-01

Given a discretization stencil, partitioning the problem domain is an important first step for the efficient solution of partial differential equations on multiple processor systems. Partitions are derived that minimize interprocessor communication when the number of processors is known a priori and each domain partition is assigned to a different processor. This partitioning technique uses the stencil structure to select appropriate partition shapes. For square problem domains, it is shown that non-standard partitions (e.g., hexagons) are frequently preferable to the standard square partitions for a variety of commonly used stencils. This investigation is concluded with a formalization of the relationship between partition shape, stencil structure, and architecture, allowing selection of optimal partitions for a variety of parallel systems.
Development for SSV on a parallel processing system (PARAGON)

NASA Astrophysics Data System (ADS)

Gothard, Benny M.; Allmen, Mark; Carroll, Michael J.; Rich, Dan

1995-12-01

A goal of the surrogate semi-autonomous vehicle (SSV) program is to have multiple vehicles navigate autonomously and cooperatively with other vehicles. This paper describes the process and tools used in porting UGV/SSV (unmanned ground vehicle) autonomous mobility and target recognition algorithms from a SISD (single instruction single data) processor architecture (i.e., a Sun SPARC workstation running C/UNIX) to a MIMD (multiple instruction multiple data) parallel processor architecture (i.e., PARAGON-a parallel set of i860 processors running C/UNIX). It discusses the gains in performance and the pitfalls of such a venture. It also examines the merits of this processor architecture (based on this conceptual prototyping effort) and programming paradigm to meet the final SSV demonstration requirements.
76 FR 65579 - Certain High Production Volume Chemicals; Test Rule and Significant New Use Rule; Fourth Group of...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-10-21

...) section 4(a)(1)(B) to require manufacturers and processors of 23 high production volume (HPV) chemical... manufacturing and petroleum refineries. Processors of one or more of the subject chemical substances (NAICS... by this action. Other types of [[Page 65581
A parallel algorithm for 2D visco-acoustic frequency-domain full-waveform inversion: application to a dense OBS data set

NASA Astrophysics Data System (ADS)

Sourbier, F.; Operto, S.; Virieux, J.

2006-12-01

We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Status report of the end-to-end ASKAP software system: towards early science operations

NASA Astrophysics Data System (ADS)

Guzman, Juan Carlos; Chapman, Jessica; Marquarding, Malte; Whiting, Matthew

2016-08-01

The Australian SKA Pathfinder (ASKAP) is a novel centimetre radio synthesis telescope currently in the commissioning phase and located in the midwest region of Western Australia. It comprises of 36 x 12 m diameter reflector antennas each equipped with state-of-the-art and award winning Phased Array Feeds (PAF) technology. The PAFs provide a wide, 30 square degree field-of-view by forming up to 36 separate dual-polarisation beams at once. This results in a high data rate: 70 TB of correlated visibilities in an 8-hour observation, requiring custom-written, high-performance software running in dedicated High Performance Computing (HPC) facilities. The first six antennas equipped with first-generation PAF technology (Mark I), named the Boolardy Engineering Test Array (BETA) have been in use since 2014 as a platform to test PAF calibration and imaging techniques, and along the way it has been producing some great science results. Commissioning of the ASKAP Array Release 1, that is the first six antennas with second-generation PAFs (Mark II) is currently under way. An integral part of the instrument is the Central Processor platform hosted at the Pawsey Supercomputing Centre in Perth, which executes custom-written software pipelines, designed specifically to meet the ASKAP imaging requirements of wide field of view and high dynamic range. There are three key hardware components of the Central Processor: The ingest nodes (16 x node cluster), the fast temporary storage (1 PB Lustre file system) and the processing supercomputer (200 TFlop system). This High-Performance Computing (HPC) platform is managed and supported by the Pawsey support team. Due to the limited amount of data generated by BETA and the first ASKAP Array Release, the Central Processor platform has been running in a more "traditional" or user-interactive mode. But this is about to change: integration and verification of the online ingest pipeline starts in early 2016, which is required to support the full 300 MHz bandwidth for Array Release 1; followed by the deployment of the real-time data processing components. In addition to the Central Processor, the first production release of the CSIRO ASKAP Science Data Archive (CASDA) has also been deployed in one of the Pawsey Supercomputing Centre facilities and it is integrated to the end-to-end ASKAP data flow system. This paper describes the current status of the "end-to-end" data flow software system from preparing observations to data acquisition, processing and archiving; and the challenges of integrating an HPC facility as a key part of the instrument. It also shares some lessons learned since the start of integration activities and the challenges ahead in preparation for the start of the Early Science program.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.