single processor computer: Topics by Science.gov

Sample records for single processor computer

Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
On nonlinear finite element analysis in single-, multi- and parallel-processors

NASA Technical Reports Server (NTRS)

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Gigaflop performance on a CRAY-2: Multitasking a computational fluid dynamics application

NASA Technical Reports Server (NTRS)

Tennille, Geoffrey M.; Overman, Andrea L.; Lambiotte, Jules J.; Streett, Craig L.

1991-01-01

The methodology is described for converting a large, long-running applications code that executed on a single processor of a CRAY-2 supercomputer to a version that executed efficiently on multiple processors. Although the conversion of every application is different, a discussion of the types of modification used to achieve gigaflop performance is included to assist others in the parallelization of applications for CRAY computers, especially those that were developed for other computers. An existing application, from the discipline of computational fluid dynamics, that had utilized over 2000 hrs of CPU time on CRAY-2 during the previous year was chosen as a test case to study the effectiveness of multitasking on a CRAY-2. The nature of dominant calculations within the application indicated that a sustained computational rate of 1 billion floating-point operations per second, or 1 gigaflop, might be achieved. The code was first analyzed and modified for optimal performance on a single processor in a batch environment. After optimal performance on a single CPU was achieved, the code was modified to use multiple processors in a dedicated environment. The results of these two efforts were merged into a single code that had a sustained computational rate of over 1 gigaflop on a CRAY-2. Timings and analysis of performance are given for both single- and multiple-processor runs.
Multiple Embedded Processors for Fault-Tolerant Computing

NASA Technical Reports Server (NTRS)

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Multi-processing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1990-01-01

The MIMD concept is applied, through multitasking, with relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. An existing single processor algorithm is mapped without the need for developing a new algorithm. The procedure of designing a code utilizing this approach is automated with the Unix stream editor. A Multiple Processor Multiple Grid (MPMG) code is developed as a demonstration of this approach. This code solves the three-dimensional, Reynolds-averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. This solver is applied to a generic, oblique-wing aircraft problem on a four-processor computer using one process for data management and nonparallel computations and three processes for pseudotime advance on three different grid systems.
Multiprocessing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1990-01-01

Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
Evaluating local indirect addressing in SIMD proc essors

NASA Technical Reports Server (NTRS)

Middleton, David; Tomboulian, Sherryl

1989-01-01

In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The available hardware resources are devoted to creating the largest possible number of processors, and consequently each individual processor must use the fewest possible resources. Disagreement exists as to whether SIMD processors should be able to generate addresses individually into their local data memory, or all processors should access the same address. The tradeoff is examined between the increased capability and the reduced number of processors that occurs in this single instruction stream, multiple, locally addressed, data (SIMLAD) model. Factors are assembled that affect this design choice, and the SIMLAD model is compared with the bare SIMD and the MIMD models.
Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1994-01-01

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Design of a massively parallel computer using bit serial processing elements

NASA Technical Reports Server (NTRS)

Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing

1995-01-01

A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.
Architectures for single-chip image computing

NASA Astrophysics Data System (ADS)

Gove, Robert J.

1992-04-01

This paper will focus on the architectures of VLSI programmable processing components for image computing applications. TI, the maker of industry-leading RISC, DSP, and graphics components, has developed an architecture for a new-generation of image processors capable of implementing a plurality of image, graphics, video, and audio computing functions. We will show that the use of a single-chip heterogeneous MIMD parallel architecture best suits this class of processors--those which will dominate the desktop multimedia, document imaging, computer graphics, and visualization systems of this decade.
FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Hyperswitch Communication Network Computer

NASA Technical Reports Server (NTRS)

Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.

1993-01-01

Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Hypercluster - Parallel processing for computational mechanics

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1988-01-01

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver

NASA Technical Reports Server (NTRS)

Ajmani, Kumud; Taylor, Arthur C., III

1994-01-01

This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Class network routing

DOEpatents

Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-09-08

Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Developing software to use parallel processing effectively. Final report, June-December 1987

DOE Office of Scientific and Technical Information (OSTI.GOV)

Center, J.

1988-10-01

This report describes the difficulties involved in writing efficient parallel programs and describes the hardware and software support currently available for generating software that utilizes processing effectively. Historically, the processing rate of single-processor computers has increased by one order of magnitude every five years. However, this pace is slowing since electronic circuitry is coming up against physical barriers. Unfortunately, the complexity of engineering and research problems continues to require ever more processing power (far in excess of the maximum estimated 3 Gflops achievable by single-processor computers). For this reason, parallel-processing architectures are receiving considerable interest, since they offer high performancemore » more cheaply than a single-processor supercomputer, such as the Cray.« less
Modeling Large Scale Circuits Using Massively Parallel Descrete-Event Simulation

DTIC Science & Technology

2013-06-01

exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power consumption...grow to exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power...Warp Speed 10.0. 2.0 INTRODUCTION As supercomputer systems approach exascale , the core count will exceed 1024 and number of transistors used in
WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
Hypercluster Parallel Processor

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela

1992-01-01

Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.

Scan line graphics generation on the massively parallel processor

NASA Technical Reports Server (NTRS)

Dorband, John E.

1988-01-01

Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.
Flood replenishment: a new method of processor control.

PubMed

Frank, E D; Gray, J E; Wilken, D A

1980-01-01

In mechanized radiographic film processors that process medium to low volumes of film, roll films, and those that process single-emulsion films from nuclear medicine scans, computed tomography, and ultrasound, it is difficult to maintain the developer solution at a stable processing level. We describe our experience using flood replenishment, which is a method in which developer replenisher containing starter solution is introduced in the processor at timed intervals, independent of the number of films being processed. By this process, a stable level of developer activity is maintained in a processor used to develop a medium to low volume of single-emulsion film.
Backend Control Processor for a Multi-Processor Relational Database Computer System.

DTIC Science & Technology

1984-12-01

SCHOOL OF ENGI. UNCRSIFID MPONTIFF DEC 84 AFXT/GCS/ENG/84D-22 F/O 9/2 L ommhhhhmhhml mhhhommhhhhhm i-2 8 -- U0. 11111= Q. 2 111.8IIII- 1111111..6...THESIS Presented to the Faculty of the School of Engineering of the Air Force Institute of Technology Air University In Partial Fulfillment of the...development of a Backend Multi-Processor Relational Database Computer System. This thesis addresses a single component of this system, the Backend Control
Recovery Act - CAREER: Sustainable Silicon -- Energy-Efficient VLSI Interconnect for Extreme-Scale Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chiang, Patrick

2014-01-31

The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
Multicore: Fallout from a Computing Evolution

ScienceCinema

Yelick, Kathy [Director, NERSC

2017-12-09

July 22, 2008 Berkeley Lab lecture: Parallel computing used to be reserved for big science and engineering projects, but in two years that's all changed. Even laptops and hand-helds use parallel processors. Unfortunately, the software hasn't kept pace. Kathy Yelick, Director of the National Energy Research Scientific Computing Center at Berkeley Lab, describes the resulting chaos and the computing community's efforts to develop exciting applications that take advantage of tens or hundreds of processors on a single chip.
Multiprocessing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1991-01-01

Little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPS or more) to improve turnaround time in computational aerodynamics. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, such improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) is applied through multitasking via a strategy that requires relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-Fortran-Unix interface. The existing code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor.
Performance of the Cell processor for biomolecular simulations

NASA Astrophysics Data System (ADS)

De Fabritiis, G.

2007-06-01

The new Cell processor represents a turning point for computing intensive applications. Here, I show that for molecular dynamics it is possible to reach an impressive sustained performance in excess of 30 Gflops with a peak of 45 Gflops for the non-bonded force calculations, over one order of magnitude faster than a single core standard processor.
Digital system for structural dynamics simulation

NASA Technical Reports Server (NTRS)

Krauter, A. I.; Lagace, L. J.; Wojnar, M. K.; Glor, C.

1982-01-01

State-of-the-art digital hardware and software for the simulation of complex structural dynamic interactions, such as those which occur in rotating structures (engine systems). System were incorporated in a designed to use an array of processors in which the computation for each physical subelement or functional subsystem would be assigned to a single specific processor in the simulator. These node processors are microprogrammed bit-slice microcomputers which function autonomously and can communicate with each other and a central control minicomputer over parallel digital lines. Inter-processor nearest neighbor communications busses pass the constants which represent physical constraints and boundary conditions. The node processors are connected to the six nearest neighbor node processors to simulate the actual physical interface of real substructures. Computer generated finite element mesh and force models can be developed with the aid of the central control minicomputer. The control computer also oversees the animation of a graphics display system, disk-based mass storage along with the individual processing elements.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.

1987-02-15

For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
Multicore: Fallout From a Computing Evolution (LBNL Summer Lecture Series)

ScienceCinema

Yelick, Kathy [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)

2018-05-07

Summer Lecture Series 2008: Parallel computing used to be reserved for big science and engineering projects, but in two years that's all changed. Even laptops and hand-helds use parallel processors. Unfortunately, the software hasn't kept pace. Kathy Yelick, Director of the National Energy Research Scientific Computing Center at Berkeley Lab, describes the resulting chaos and the computing community's efforts to develop exciting applications that take advantage of tens or hundreds of processors on a single chip.
Scheduling time-critical graphics on multiple processors

NASA Technical Reports Server (NTRS)

Meyer, Tom W.; Hughes, John F.

1995-01-01

This paper describes an algorithm for the scheduling of time-critical rendering and computation tasks on single- and multiple-processor architectures, with minimal pipelining. It was developed to manage scientific visualization scenes consisting of hundreds of objects, each of which can be computed and displayed at thousands of possible resolution levels. The algorithm generates the time-critical schedule using progressive-refinement techniques; it always returns a feasible schedule and, when allowed to run to completion, produces a near-optimal schedule which takes advantage of almost the entire multiple-processor system.
Formal design and verification of a reliable computing platform for real-time control. Phase 1: Results

NASA Technical Reports Server (NTRS)

Divito, Ben L.; Butler, Ricky W.; Caldwell, James L.

1990-01-01

A high-level design is presented for a reliable computing platform for real-time control applications. Design tradeoffs and analyses related to the development of the fault-tolerant computing platform are discussed. The architecture is formalized and shown to satisfy a key correctness property. The reliable computing platform uses replicated processors and majority voting to achieve fault tolerance. Under the assumption of a majority of processors working in each frame, it is shown that the replicated system computes the same results as a single processor system not subject to failures. Sufficient conditions are obtained to establish that the replicated system recovers from transient faults within a bounded amount of time. Three different voting schemes are examined and proved to satisfy the bounded recovery time conditions.
Multi-petascale highly efficient parallel supercomputer

DOEpatents

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

2015-07-14

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
A parallel algorithm for computing the eigenvalues of a symmetric tridiagonal matrix

NASA Technical Reports Server (NTRS)

Swarztrauber, Paul N.

1993-01-01

A parallel algorithm, called polysection, is presented for computing the eigenvalues of a symmetric tridiagonal matrix. The method is based on a quadratic recurrence in which the characteristic polynomial is constructed on a binary tree from polynomials whose degree doubles at each level. Intervals that contain exactly one zero are determined by the zeros of polynomials at the previous level which ensures that different processors compute different zeros. The signs of the polynomials at the interval endpoints are determined a priori and used to guarantee that all zeros are found. The use of finite-precision arithmetic may result in multiple zeros; however, in this case, the intervals coalesce and their number determines exactly the multiplicity of the zero. For an N x N matrix the eigenvalues can be determined in O(log-squared N) time with N-squared processors and O(N) time with N processors. The method is compared with a parallel variant of bisection that requires O(N-squared) time on a single processor, O(N) time with N processors, and O(log N) time with N-squared processors.
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine

NASA Astrophysics Data System (ADS)

Goto, Hiroyuki

This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband Engine™ (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3™ equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Development of a small-scale computer cluster

NASA Astrophysics Data System (ADS)

Wilhelm, Jay; Smith, Justin T.; Smith, James E.

2008-04-01

An increase in demand for computing power in academia has necessitated the need for high performance machines. Computing power of a single processor has been steadily increasing, but lags behind the demand for fast simulations. Since a single processor has hard limits to its performance, a cluster of computers can have the ability to multiply the performance of a single computer with the proper software. Cluster computing has therefore become a much sought after technology. Typical desktop computers could be used for cluster computing, but are not intended for constant full speed operation and take up more space than rack mount servers. Specialty computers that are designed to be used in clusters meet high availability and space requirements, but can be costly. A market segment exists where custom built desktop computers can be arranged in a rack mount situation, gaining the space saving of traditional rack mount computers while remaining cost effective. To explore these possibilities, an experiment was performed to develop a computing cluster using desktop components for the purpose of decreasing computation time of advanced simulations. This study indicates that small-scale cluster can be built from off-the-shelf components which multiplies the performance of a single desktop machine, while minimizing occupied space and still remaining cost effective.
Vectorization with SIMD extensions speeds up reconstruction in electron tomography.

PubMed

Agulleiro, J I; Garzón, E M; García, I; Fernández, J J

2010-06-01

Electron tomography allows structural studies of cellular structures at molecular detail. Large 3D reconstructions are needed to meet the resolution requirements. The processing time to compute these large volumes may be considerable and so, high performance computing techniques have been used traditionally. This work presents a vector approach to tomographic reconstruction that relies on the exploitation of the SIMD extensions available in modern processors in combination to other single processor optimization techniques. This approach succeeds in producing full resolution tomograms with an important reduction in processing time, as evaluated with the most common reconstruction algorithms, namely WBP and SIRT. The main advantage stems from the fact that this approach is to be run on standard computers without the need of specialized hardware, which facilitates the development, use and management of programs. Future trends in processor design open excellent opportunities for vector processing with processor's SIMD extensions in the field of 3D electron microscopy.
Matrix-vector multiplication using digital partitioning for more accurate optical computing

NASA Technical Reports Server (NTRS)

Gary, C. K.

1992-01-01

Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient alogrithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor.

Distributed processor allocation for launching applications in a massively connected processors complex

DOEpatents

Pedretti, Kevin

2008-11-18

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

NASA Astrophysics Data System (ADS)

Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

2013-03-01

With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.
Efficiently modeling neural networks on massively parallel computers

NASA Technical Reports Server (NTRS)

Farber, Robert M.

1993-01-01

Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors

NASA Technical Reports Server (NTRS)

Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

2015-01-01

This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets

PubMed Central

Plaku, Erion; Kavraki, Lydia E.

2009-01-01

High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Performance of a plasma fluid code on the Intel parallel computers

NASA Technical Reports Server (NTRS)

Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

1992-01-01

One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.
FPGA-based distributed computing microarchitecture for complex physical dynamics investigation.

PubMed

Borgese, Gianluca; Pace, Calogero; Pantano, Pietro; Bilotta, Eleonora

2013-09-01

In this paper, we present a distributed computing system, called DCMARK, aimed at solving partial differential equations at the basis of many investigation fields, such as solid state physics, nuclear physics, and plasma physics. This distributed architecture is based on the cellular neural network paradigm, which allows us to divide the differential equation system solving into many parallel integration operations to be executed by a custom multiprocessor system. We push the number of processors to the limit of one processor for each equation. In order to test the present idea, we choose to implement DCMARK on a single FPGA, designing the single processor in order to minimize its hardware requirements and to obtain a large number of easily interconnected processors. This approach is particularly suited to study the properties of 1-, 2- and 3-D locally interconnected dynamical systems. In order to test the computing platform, we implement a 200 cells, Korteweg-de Vries (KdV) equation solver and perform a comparison between simulations conducted on a high performance PC and on our system. Since our distributed architecture takes a constant computing time to solve the equation system, independently of the number of dynamical elements (cells) of the CNN array, it allows us to reduce the elaboration time more than other similar systems in the literature. To ensure a high level of reconfigurability, we design a compact system on programmable chip managed by a softcore processor, which controls the fast data/control communication between our system and a PC Host. An intuitively graphical user interface allows us to change the calculation parameters and plot the results.
High Speed White Dwarf Asteroseismology with the Herty Hall Cluster

NASA Astrophysics Data System (ADS)

Gray, Aaron; Kim, A.

2012-01-01

Asteroseismology is the process of using observed oscillations of stars to infer their interior structure. In high speed asteroseismology, we complete that by quickly computing hundreds of thousands of models to match the observed period spectra. Each model on a single processor takes five to ten seconds to run. Therefore, we use a cluster of sixteen Dell Workstations with dual-core processors. The computers use the Ubuntu operating system and Apache Hadoop software to manage workloads.
Experimental Adiabatic Quantum Factorization under Ambient Conditions Based on a Solid-State Single Spin System.

PubMed

Xu, Kebiao; Xie, Tianyu; Li, Zhaokai; Xu, Xiangkun; Wang, Mengqi; Ye, Xiangyu; Kong, Fei; Geng, Jianpei; Duan, Changkui; Shi, Fazhan; Du, Jiangfeng

2017-03-31

The adiabatic quantum computation is a universal and robust method of quantum computing. In this architecture, the problem can be solved by adiabatically evolving the quantum processor from the ground state of a simple initial Hamiltonian to that of a final one, which encodes the solution of the problem. Adiabatic quantum computation has been proved to be a compatible candidate for scalable quantum computation. In this Letter, we report on the experimental realization of an adiabatic quantum algorithm on a single solid spin system under ambient conditions. All elements of adiabatic quantum computation, including initial state preparation, adiabatic evolution (simulated by optimal control), and final state read-out, are realized experimentally. As an example, we found the ground state of the problem Hamiltonian S_{z}I_{z} on our adiabatic quantum processor, which can be mapped to the factorization of 35 into its prime factors 5 and 7.
Experimental Adiabatic Quantum Factorization under Ambient Conditions Based on a Solid-State Single Spin System

NASA Astrophysics Data System (ADS)

Xu, Kebiao; Xie, Tianyu; Li, Zhaokai; Xu, Xiangkun; Wang, Mengqi; Ye, Xiangyu; Kong, Fei; Geng, Jianpei; Duan, Changkui; Shi, Fazhan; Du, Jiangfeng

2017-03-01

The adiabatic quantum computation is a universal and robust method of quantum computing. In this architecture, the problem can be solved by adiabatically evolving the quantum processor from the ground state of a simple initial Hamiltonian to that of a final one, which encodes the solution of the problem. Adiabatic quantum computation has been proved to be a compatible candidate for scalable quantum computation. In this Letter, we report on the experimental realization of an adiabatic quantum algorithm on a single solid spin system under ambient conditions. All elements of adiabatic quantum computation, including initial state preparation, adiabatic evolution (simulated by optimal control), and final state read-out, are realized experimentally. As an example, we found the ground state of the problem Hamiltonian SzIz on our adiabatic quantum processor, which can be mapped to the factorization of 35 into its prime factors 5 and 7.
Demonstration of Qubit Operations Below a Rigorous Fault Tolerance Threshold With Gate Set Tomography (Open Access, Publisher’s Version)

DTIC Science & Technology

2017-02-15

Maunz2 Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone...information processors have been demonstrated experimentally using superconducting circuits1–3, electrons in semiconductors4–6, trapped atoms and...qubit quantum information processor has been realized14, and single- qubit gates have demonstrated randomized benchmarking (RB) infidelities as low as 10
Spacecube V2.0 Micro Single Board Computer

NASA Technical Reports Server (NTRS)

Petrick, David J. (Inventor); Geist, Alessandro (Inventor); Lin, Michael R. (Inventor); Crum, Gary R. (Inventor)

2017-01-01

A single board computer system radiation hardened for space flight includes a printed circuit board having a top side and bottom side; a reconfigurable field programmable gate array (FPGA) processor device disposed on the top side; a connector disposed on the top side; a plurality of peripheral components mounted on the bottom side; and wherein a size of the single board computer system is not greater than approximately 7 cm.times.7 cm.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Ray tracing on the MPP

NASA Technical Reports Server (NTRS)

Dorband, John E.

1987-01-01

Generating graphics to faithfully represent information can be a computationally intensive task. A way of using the Massively Parallel Processor to generate images by ray tracing is presented. This technique uses sort computation, a method of performing generalized routing interspersed with computation on a single-instruction-multiple-data (SIMD) computer.
Fault-Tolerant, Radiation-Hard DSP

NASA Technical Reports Server (NTRS)

Czajkowski, David

2011-01-01

Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high-performance signal processing include significant increase in onboard science data processing, enabling orders of magnitude reduction in required communication bandwidth for science data return, orders of magnitude improvement in onboard mission planning and critical decision making, and the ability to rapidly respond to changing mission environments, thus enabling opportunistic science and orders of magnitude reduction in the cost of mission operations through reduction of required staff. Additional benefits of COTS-based, high-performance signal processing include the ability to leverage considerable commercial and academic investments in advanced computing tools, techniques, and infra structure, and the familiarity of the science and IT community with these computing environments.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
Novel Robotic Tools for Piping Inspection and Repair

DTIC Science & Technology

2015-01-14

was selected due to its small size, and peripheral capability. The SoM measures 50mm x 44mm. The SoM processor is an ARM Cortex -A8 running at720MHz...designing an embedded computing system from scratch. The SoM is a single integrated module which contains the processor , RAM, power management, and
Design and Analysis of Scheduling Policies for Real-Time Computer Systems

DTIC Science & Technology

1992-01-01

C. M. Krishna, "The Impact of Workload on the Reliability of Real-Time Processor Triads," to appear in Micro . Rel. [17] J.F. Kurose, "Performance... Processor Triads", to appear in Micro . Rel. "* J.F. Kurose. "Performance Analysis of Minimum Laxity Scheduling in Discrete Time Queue- ing Systems", to...exponentially distributed service times and deadlines. A similar model was developed for the ED policy for a single processor system under identical
GASP-PL/I Simulation of Integrated Avionic System Processor Architectures. M.S. Thesis

NASA Technical Reports Server (NTRS)

Brent, G. A.

1978-01-01

A development study sponsored by NASA was completed in July 1977 which proposed a complete integration of all aircraft instrumentation into a single modular system. Instead of using the current single-function aircraft instruments, computers compiled and displayed inflight information for the pilot. A processor architecture called the Team Architecture was proposed. This is a hardware/software approach to high-reliability computer systems. A follow-up study of the proposed Team Architecture is reported. GASP-PL/1 simulation models are used to evaluate the operating characteristics of the Team Architecture. The problem, model development, simulation programs and results at length are presented. Also included are program input formats, outputs and listings.

Advanced computer architecture specification for automated weld systems

NASA Technical Reports Server (NTRS)

Katsinis, Constantine

1994-01-01

This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
Processors for wavelet analysis and synthesis: NIFS and TI-C80 MVP

NASA Astrophysics Data System (ADS)

Brooks, Geoffrey W.

1996-03-01

Two processors are considered for image quadrature mirror filtering (QMF). The neuromorphic infrared focal-plane sensor (NIFS) is an existing prototype analog processor offering high speed spatio-temporal Gaussian filtering, which could be used for the QMF low- pass function, and difference of Gaussian filtering, which could be used for the QMF high- pass function. Although not designed specifically for wavelet analysis, the biologically- inspired system accomplishes the most computationally intensive part of QMF processing. The Texas Instruments (TI) TMS320C80 Multimedia Video Processor (MVP) is a 32-bit RISC master processor with four advanced digital signal processors (DSPs) on a single chip. Algorithm partitioning, memory management and other issues are considered for optimal performance. This paper presents these considerations with simulated results leading to processor implementation of high-speed QMF analysis and synthesis.
ALEGRA -- A massively parallel h-adaptive code for solid dynamics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Summers, R.M.; Wong, M.K.; Boucheron, E.A.

1997-12-31

ALEGRA is a multi-material, arbitrary-Lagrangian-Eulerian (ALE) code for solid dynamics designed to run on massively parallel (MP) computers. It combines the features of modern Eulerian shock codes, such as CTH, with modern Lagrangian structural analysis codes using an unstructured grid. ALEGRA is being developed for use on the teraflop supercomputers to conduct advanced three-dimensional (3D) simulations of shock phenomena important to a variety of systems. ALEGRA was designed with the Single Program Multiple Data (SPMD) paradigm, in which the mesh is decomposed into sub-meshes so that each processor gets a single sub-mesh with approximately the same number of elements. Usingmore » this approach the authors have been able to produce a single code that can scale from one processor to thousands of processors. A current major effort is to develop efficient, high precision simulation capabilities for ALEGRA, without the computational cost of using a global highly resolved mesh, through flexible, robust h-adaptivity of finite elements. H-adaptivity is the dynamic refinement of the mesh by subdividing elements, thus changing the characteristic element size and reducing numerical error. The authors are working on several major technical challenges that must be met to make effective use of HAMMER on MP computers.« less
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Donofrio, David

A method and apparatus for performing stencil computations efficiently are disclosed. In one embodiment, a processor receives an offset, and in response, retrieves a value from a memory via a single instruction, where the retrieving comprises: identifying, based on the offset, one of a plurality of registers of the processor; loading an address stored in the identified register; and retrieving from the memory the value at the address.
Formal design specification of a Processor Interface Unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1992-01-01

This report describes work to formally specify the requirements and design of a processor interface unit (PIU), a single-chip subsystem providing memory-interface bus-interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. The need for high-quality design assurance in such applications is an undisputed fact, given the disastrous consequences that even a single design flaw can produce. Thus, the further development and application of formal methods to fault-tolerant systems is of critical importance as these systems see increasing use in modern society.
Parallel computing using a Lagrangian formulation

NASA Technical Reports Server (NTRS)

Liou, May-Fun; Loh, Ching Yuen

1991-01-01

A new Lagrangian formulation of the Euler equation is adopted for the calculation of 2-D supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, a better than six times speed-up was achieved on a 8192-processor CM-2 over a single processor of a CRAY-2.
Parallel computing using a Lagrangian formulation

NASA Technical Reports Server (NTRS)

Liou, May-Fun; Loh, Ching-Yuen

1992-01-01

This paper adopts a new Lagrangian formulation of the Euler equation for the calculation of two dimensional supersonic steady flow. The Lagrangian formulation represents the inherent parallelism of the flow field better than the common Eulerian formulation and offers a competitive alternative on parallel computers. The implementation of the Lagrangian formulation on the Thinking Machines Corporation CM-2 Computer is described. The program uses a finite volume, first-order Godunov scheme and exhibits high accuracy in dealing with multidimensional discontinuities (slip-line and shock). By using this formulation, we have achieved better than six times speed-up on a 8192-processor CM-2 over a single processor of a CRAY-2.
Compute Server Performance Results

NASA Technical Reports Server (NTRS)

Stockdale, I. E.; Barton, John; Woodrow, Thomas (Technical Monitor)

1994-01-01

Parallel-vector supercomputers have been the workhorses of high performance computing. As expectations of future computing needs have risen faster than projected vector supercomputer performance, much work has been done investigating the feasibility of using Massively Parallel Processor systems as supercomputers. An even more recent development is the availability of high performance workstations which have the potential, when clustered together, to replace parallel-vector systems. We present a systematic comparison of floating point performance and price-performance for various compute server systems. A suite of highly vectorized programs was run on systems including traditional vector systems such as the Cray C90, and RISC workstations such as the IBM RS/6000 590 and the SGI R8000. The C90 system delivers 460 million floating point operations per second (FLOPS), the highest single processor rate of any vendor. However, if the price-performance ration (PPR) is considered to be most important, then the IBM and SGI processors are superior to the C90 processors. Even without code tuning, the IBM and SGI PPR's of 260 and 220 FLOPS per dollar exceed the C90 PPR of 160 FLOPS per dollar when running our highly vectorized suite,
Broadcasting collective operation contributions throughout a parallel computer

DOEpatents

Faraj, Ahmad [Rochester, MN

2012-02-21

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
Full-Authority Fault-Tolerant Electronic Engine Control System for Variable Cycle Engines.

DTIC Science & Technology

1982-04-01

single internally self-checked VLSI micro - processor . The selected configuration is an externally checked pair of com- mercially available...Electronic Engine Control FPMH Failures per Million Hours FTMP Fault Tolerant Multi- Processor FTSC Fault Tolerant Spaceborn Computer GRAMP Generalized...Removal * MTBR Mean Time Between Repair MTTF Mean Time to Failure xiii List of Abbreviations (continued) - NH High Pressure Rotor Speed O&S Operating
Montage Version 3.0

NASA Technical Reports Server (NTRS)

Jacob, Joseph; Katz, Daniel; Prince, Thomas; Berriman, Graham; Good, John; Laity, Anastasia

2006-01-01

The final version (3.0) of the Montage software has been released. To recapitulate from previous NASA Tech Briefs articles about Montage: This software generates custom, science-grade mosaics of astronomical images on demand from input files that comply with the Flexible Image Transport System (FITS) standard and contain image data registered on projections that comply with the World Coordinate System (WCS) standards. This software can be executed on single-processor computers, multi-processor computers, and such networks of geographically dispersed computers as the National Science Foundation s TeraGrid or NASA s Information Power Grid. The primary advantage of running Montage in a grid environment is that computations can be done on a remote supercomputer for efficiency. Multiple computers at different sites can be used for different parts of a computation a significant advantage in cases of computations for large mosaics that demand more processor time than is available at any one site. Version 3.0 incorporates several improvements over prior versions. The most significant improvement is that this version is accessible to scientists located anywhere, through operational Web services that provide access to data from several large astronomical surveys and construct mosaics on either local workstations or remote computational grids as needed.
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
Multitasking for flows about multiple body configurations using the chimera grid scheme

NASA Technical Reports Server (NTRS)

Dougherty, F. C.; Morgan, R. L.

1987-01-01

The multitasking of a finite-difference scheme using multiple overset meshes is described. In this chimera, or multiple overset mesh approach, a multiple body configuration is mapped using a major grid about the main component of the configuration, with minor overset meshes used to map each additional component. This type of code is well suited to multitasking. Both steady and unsteady two dimensional computations are run on parallel processors on a CRAY-X/MP 48, usually with one mesh per processor. Flow field results are compared with single processor results to demonstrate the feasibility of running multiple mesh codes on parallel processors and to show the increase in efficiency.
An efficient implementation of semi-numerical computation of the Hartree-Fock exchange on the Intel Phi processor

NASA Astrophysics Data System (ADS)

Liu, Fenglai; Kong, Jing

2018-07-01

Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
Programmable computing with a single magnetoresistive element

NASA Astrophysics Data System (ADS)

Ney, A.; Pampuch, C.; Koch, R.; Ploog, K. H.

2003-10-01

The development of transistor-based integrated circuits for modern computing is a story of great success. However, the proved concept for enhancing computational power by continuous miniaturization is approaching its fundamental limits. Alternative approaches consider logic elements that are reconfigurable at run-time to overcome the rigid architecture of the present hardware systems. Implementation of parallel algorithms on such `chameleon' processors has the potential to yield a dramatic increase of computational speed, competitive with that of supercomputers. Owing to their functional flexibility, `chameleon' processors can be readily optimized with respect to any computer application. In conventional microprocessors, information must be transferred to a memory to prevent it from getting lost, because electrically processed information is volatile. Therefore the computational performance can be improved if the logic gate is additionally capable of storing the output. Here we describe a simple hardware concept for a programmable logic element that is based on a single magnetic random access memory (MRAM) cell. It combines the inherent advantage of a non-volatile output with flexible functionality which can be selected at run-time to operate as an AND, OR, NAND or NOR gate.
Autonomic Cluster Management System (ACMS): A Demonstration of Autonomic Principles at Work

NASA Technical Reports Server (NTRS)

Baldassari, James D.; Kopec, Christopher L.; Leshay, Eric S.; Truszkowski, Walt; Finkel, David

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of achieving significant computational capabilities for high-performance computing applications, while simultaneously affording the ability to. increase that capability simply by adding more (inexpensive) processors. However, the task of manually managing and con.guring a cluster quickly becomes impossible as the cluster grows in size. Autonomic computing is a relatively new approach to managing complex systems that can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Automatic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management.
Methods and Apparatus for Aggregation of Multiple Pulse Code Modulation Channels into a Signal Time Division Multiplexing Stream

NASA Technical Reports Server (NTRS)

Chang, Chen J. (Inventor); Liaghati, Jr., Amir L. (Inventor); Liaghati, Mahsa L. (Inventor)

2018-01-01

Methods and apparatus are provided for telemetry processing using a telemetry processor. The telemetry processor can include a plurality of communications interfaces, a computer processor, and data storage. The telemetry processor can buffer sensor data by: receiving a frame of sensor data using a first communications interface and clock data using a second communications interface, receiving an end of frame signal using a third communications interface, and storing the received frame of sensor data in the data storage. After buffering the sensor data, the telemetry processor can generate an encapsulated data packet including a single encapsulated data packet header, the buffered sensor data, and identifiers identifying telemetry devices that provided the sensor data. A format of the encapsulated data packet can comply with a Consultative Committee for Space Data Systems (CCSDS) standard. The telemetry processor can send the encapsulated data packet using a fourth and a fifth communications interfaces.
High performance, low cost, self-contained, multipurpose PC based ground systems

NASA Technical Reports Server (NTRS)

Forman, Michael; Nickum, William; Troendly, Gregory

1993-01-01

The use of embedded processors greatly enhances the capabilities of personal computers when used for telemetry processing and command control center functions. Parallel architectures based on the use of transputers are shown to be very versatile and reusable, and the synergism between the PC and the embedded processor with transputers results in single unit, low cost workstations of 20 less than MIPS less than or equal to 1000.
SOP: parallel surrogate global optimization with Pareto center selection for computationally expensive single objective problems

DOE PAGES

Krityakierne, Tipaluck; Akhtar, Taimoor; Shoemaker, Christine A.

2016-02-02

This paper presents a parallel surrogate-based global optimization method for computationally expensive objective functions that is more effective for larger numbers of processors. To reach this goal, we integrated concepts from multi-objective optimization and tabu search into, single objective, surrogate optimization. Our proposed derivative-free algorithm, called SOP, uses non-dominated sorting of points for which the expensive function has been previously evaluated. The two objectives are the expensive function value of the point and the minimum distance of the point to previously evaluated points. Based on the results of non-dominated sorting, P points from the sorted fronts are selected as centersmore » from which many candidate points are generated by random perturbations. Based on surrogate approximation, the best candidate point is subsequently selected for expensive evaluation for each of the P centers, with simultaneous computation on P processors. Centers that previously did not generate good solutions are tabu with a given tenure. We show almost sure convergence of this algorithm under some conditions. The performance of SOP is compared with two RBF based methods. The test results show that SOP is an efficient method that can reduce time required to find a good near optimal solution. In a number of cases the efficiency of SOP is so good that SOP with 8 processors found an accurate answer in less wall-clock time than the other algorithms did with 32 processors.« less

Optimizing the inner loop of the gravitational force interaction on modern processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Warren, Michael S

2010-12-08

We have achieved superior performance on multiple generations of the fastest supercomputers in the world with our hashed oct-tree N-body code (HOT), spanning almost two decades and garnering multiple Gordon Bell Prizes for significant achievement in parallel processing. Execution time for our N-body code is largely influenced by the force calculation in the inner loop. Improvements to the inner loop using SSE3 instructions has enabled the calculation of over 200 million gravitational interactions per second per processor on a 2.6 GHz Opteron, for a computational rate of over 7 Gflops in single precision (700/0 of peak). We obtain optimal performancemore » some processors (including the Cell) by decomposing the reciprocal square root function required for a gravitational interaction into a table lookup, Chebychev polynomial interpolation, and Newton-Raphson iteration, using the algorithm of Karp. By unrolling the loop by a factor of six, and using SPU intrinsics to compute on vectors, we obtain performance of over 16 Gflops on a single Cell SPE. Aggregated over the 8 SPEs on a Cell processor, the overall performance is roughly 130 Gflops. In comparison, the ordinary C version of our inner loop only obtains 1.6 Gflops per SPE with the spuxlc compiler.« less
A sweep algorithm for massively parallel simulation of circuit-switched networks

NASA Technical Reports Server (NTRS)

Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.

1992-01-01

A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
On the relationship between parallel computation and graph embedding

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gupta, A.K.

1989-01-01

The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less
Technology transfer of military space microprocessor developments

NASA Astrophysics Data System (ADS)

Gorden, C.; King, D.; Byington, L.; Lanza, D.

1999-01-01

Over the past 13 years the Air Force Research Laboratory (AFRL) has led the development of microprocessors and computers for USAF space and strategic missile applications. As a result of these Air Force development programs, advanced computer technology is available for use by civil and commercial space customers as well. The Generic VHSIC Spaceborne Computer (GVSC) program began in 1985 at AFRL to fulfill a deficiency in the availability of space-qualified data and control processors. GVSC developed a radiation hardened multi-chip version of the 16-bit, Mil-Std 1750A microprocessor. The follow-on to GVSC, the Advanced Spaceborne Computer Module (ASCM) program, was initiated by AFRL to establish two industrial sources for complete, radiation-hardened 16-bit and 32-bit computers and microelectronic components. Development of the Control Processor Module (CPM), the first of two ASCM contract phases, concluded in 1994 with the availability of two sources for space-qualified, 16-bit Mil-Std-1750A computers, cards, multi-chip modules, and integrated circuits. The second phase of the program, the Advanced Technology Insertion Module (ATIM), was completed in December 1997. ATIM developed two single board computers based on 32-bit reduced instruction set computer (RISC) processors. GVSC, CPM, and ATIM technologies are flying or baselined into the majority of today's DoD, NASA, and commercial satellite systems.
Geospace simulations using modern accelerator processor technology

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D. J.

2009-12-01

OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
Methods and systems for providing reconfigurable and recoverable computing resources

NASA Technical Reports Server (NTRS)

Stange, Kent (Inventor); Hess, Richard (Inventor); Kelley, Gerald B (Inventor); Rogers, Randy (Inventor)

2010-01-01

A method for optimizing the use of digital computing resources to achieve reliability and availability of the computing resources is disclosed. The method comprises providing one or more processors with a recovery mechanism, the one or more processors executing one or more applications. A determination is made whether the one or more processors needs to be reconfigured. A rapid recovery is employed to reconfigure the one or more processors when needed. A computing system that provides reconfigurable and recoverable computing resources is also disclosed. The system comprises one or more processors with a recovery mechanism, with the one or more processors configured to execute a first application, and an additional processor configured to execute a second application different than the first application. The additional processor is reconfigurable with rapid recovery such that the additional processor can execute the first application when one of the one more processors fails.
Review of NASA's (National Aeronautics and Space Administration) Numerical Aerodynamic Simulation Program

NASA Technical Reports Server (NTRS)

1984-01-01

NASA has planned a supercomputer for computational fluid dynamics research since the mid-1970's. With the approval of the Numerical Aerodynamic Simulation Program as a FY 1984 new start, Congress requested an assessment of the program's objectives, projected short- and long-term uses, program design, computer architecture, user needs, and handling of proprietary and classified information. Specifically requested was an examination of the merits of proceeding with multiple high speed processor (HSP) systems contrasted with a single high speed processor system. The panel found NASA's objectives and projected uses sound and the projected distribution of users as realistic as possible at this stage. The multiple-HSP, whereby new, more powerful state-of-the-art HSP's would be integrated into a flexible network, was judged to present major advantages over any single HSP system.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

PubMed Central

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

PubMed

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
A site oriented supercomputer for theoretical physics: The Fermilab Advanced Computer Program Multi Array Processor System (ACMAPS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nash, T.; Atac, R.; Cook, A.

1989-03-06

The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
Special-purpose computing for dense stellar systems

NASA Astrophysics Data System (ADS)

Makino, Junichiro

2007-08-01

I'll describe the current status of the GRAPE-DR project. The GRAPE-DR is the next-generation hardware for N-body simulation. Unlike the previous GRAPE hardwares, it is programmable SIMD machine with a large number of simple processors integrated into a single chip. The GRAPE-DR chip consists of 512 simple processors and operates at the clock speed of 500 MHz, delivering the theoretical peak speed of 512/226 Gflops (single/double precision). As of August 2006, the first prototype board with the sample chip successfully passed the test we prepared. The full GRAPE-DR system will consist of 4096 chips, reaching the theoretical peak speed of 2 Pflops.
Massively parallel processor computer

NASA Technical Reports Server (NTRS)

Fung, L. W. (Inventor)

1983-01-01

An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

2008-01-01

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
Parallel implementation of an adaptive scheme for 3D unstructured grids on the SP2

NASA Technical Reports Server (NTRS)

Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.0X speedup on 64 processors when 10 percent of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all the mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
Parallel Implementation of an Adaptive Scheme for 3D Unstructured Grids on the SP2

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak; Strawn, Roger C.

1996-01-01

Dynamic mesh adaption on unstructured grids is a powerful tool for computing unsteady flows that require local grid modifications to efficiently resolve solution features. For this work, we consider an edge-based adaption scheme that has shown good single-processor performance on the C90. We report on our experience parallelizing this code for the SP2. Results show a 47.OX speedup on 64 processors when 10% of the mesh is randomly refined. Performance deteriorates to 7.7X when the same number of edges are refined in a highly-localized region. This is because almost all mesh adaption is confined to a single processor. However, this problem can be remedied by repartitioning the mesh immediately after targeting edges for refinement but before the actual adaption takes place. With this change, the speedup improves dramatically to 43.6X.
Parallel solution of closely coupled systems

NASA Technical Reports Server (NTRS)

Utku, S.; Salama, M.

1986-01-01

The odd-even permutation and associated unitary transformations for reordering the matrix coefficient A are employed as means of breaking the strong seriality which is characteristic of closely coupled systems. The nested dissection technique is also reviewed, and the equivalence between reordering A and dissecting its network is established. The effect of transforming A with odd-even permutation on its topology and the topology of its Cholesky factors is discussed. This leads to the construction of directed graphs showing the computational steps required for factoring A, their precedence relationships and their sequential and concurrent assignment to the available processors. Expressions for the speed-up and efficiency of using N processors in parallel relative to the sequential use of a single processor are derived from the directed graph. Similar expressions are also derived when the number of available processors is fewer than required.
Towards the formal verification of the requirements and design of a processor interface unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

The formal verification of the design and partial requirements for a Processor Interface Unit (PIU) using the Higher Order Logic (HOL) theorem-proving system is described. The processor interface unit is a single-chip subsystem within a fault-tolerant embedded system under development within the Boeing Defense and Space Group. It provides the opportunity to investigate the specification and verification of a real-world subsystem within a commercially-developed fault-tolerant computer. An overview of the PIU verification effort is given. The actual HOL listing from the verification effort are documented in a companion NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings' including the general-purpose HOL theories and definitions that support the PIU verification as well as tactics used in the proofs.
A parallel Monte Carlo code for planar and SPECT imaging: implementation, verification and applications in (131)I SPECT.

PubMed

Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F

2002-02-01

This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
A Programming Framework for Scientific Applications on CPU-GPU Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Owens, John

2013-03-24

At a high level, my research interests center around designing, programming, and evaluating computer systems that use new approaches to solve interesting problems. The rapid change of technology allows a variety of different architectural approaches to computationally difficult problems, and a constantly shifting set of constraints and trends makes the solutions to these problems both challenging and interesting. One of the most important recent trends in computing has been a move to commodity parallel architectures. This sea change is motivated by the industry’s inability to continue to profitably increase performance on a single processor and instead to move to multiplemore » parallel processors. In the period of review, my most significant work has been leading a research group looking at the use of the graphics processing unit (GPU) as a general-purpose processor. GPUs can potentially deliver superior performance on a broad range of problems than their CPU counterparts, but effectively mapping complex applications to a parallel programming model with an emerging programming environment is a significant and important research problem.« less

Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

PubMed

Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

2018-05-03

Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Interconnect-free parallel logic circuits in a single mechanical resonator

PubMed Central

Mahboob, I.; Flurin, E.; Nishiguchi, K.; Fujiwara, A.; Yamaguchi, H.

2011-01-01

In conventional computers, wiring between transistors is required to enable the execution of Boolean logic functions. This has resulted in processors in which billions of transistors are physically interconnected, which limits integration densities, gives rise to huge power consumption and restricts processing speeds. A method to eliminate wiring amongst transistors by condensing Boolean logic into a single active element is thus highly desirable. Here, we demonstrate a novel logic architecture using only a single electromechanical parametric resonator into which multiple channels of binary information are encoded as mechanical oscillations at different frequencies. The parametric resonator can mix these channels, resulting in new mechanical oscillation states that enable the construction of AND, OR and XOR logic gates as well as multibit logic circuits. Moreover, the mechanical logic gates and circuits can be executed simultaneously, giving rise to the prospect of a parallel logic processor in just a single mechanical resonator. PMID:21326230
Interconnect-free parallel logic circuits in a single mechanical resonator.

PubMed

Mahboob, I; Flurin, E; Nishiguchi, K; Fujiwara, A; Yamaguchi, H

2011-02-15

In conventional computers, wiring between transistors is required to enable the execution of Boolean logic functions. This has resulted in processors in which billions of transistors are physically interconnected, which limits integration densities, gives rise to huge power consumption and restricts processing speeds. A method to eliminate wiring amongst transistors by condensing Boolean logic into a single active element is thus highly desirable. Here, we demonstrate a novel logic architecture using only a single electromechanical parametric resonator into which multiple channels of binary information are encoded as mechanical oscillations at different frequencies. The parametric resonator can mix these channels, resulting in new mechanical oscillation states that enable the construction of AND, OR and XOR logic gates as well as multibit logic circuits. Moreover, the mechanical logic gates and circuits can be executed simultaneously, giving rise to the prospect of a parallel logic processor in just a single mechanical resonator.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
GPUs, a New Tool of Acceleration in CFD: Efficiency and Reliability on Smoothed Particle Hydrodynamics Methods

PubMed Central

Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.

2011-01-01

Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185
Fast neural net simulation with a DSP processor array.

PubMed

Muller, U A; Gunzinger, A; Guggenbuhl, W

1995-01-01

This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.
WMAP C&DH Software

NASA Technical Reports Server (NTRS)

Cudmore, Alan; Leath, Tim; Ferrer, Art; Miller, Todd; Walters, Mark; Savadkin, Bruce; Wu, Ji-Wei; Slegel, Steve; Stagmer, Emory

2007-01-01

The command-and-data-handling (C&DH) software of the Wilkinson Microwave Anisotropy Probe (WMAP) spacecraft functions as the sole interface between (1) the spacecraft and its instrument subsystem and (2) ground operations equipment. This software includes a command-decoding and -distribution system, a telemetry/data-handling system, and a data-storage-and-playback system. This software performs onboard processing of attitude sensor data and generates commands for attitude-control actuators in a closed-loop fashion. It also processes stored commands and monitors health and safety functions for the spacecraft and its instrument subsystems. The basic functionality of this software is the same of that of the older C&DH software of the Rossi X-Ray Timing Explorer (RXTE) spacecraft, the main difference being the addition of the attitude-control functionality. Previously, the C&DH and attitude-control computations were performed by different processors because a single RXTE processor did not have enough processing power. The WMAP spacecraft includes a more-powerful processor capable of performing both computations.
Hot Chips and Hot Interconnects for High End Computing Systems

NASA Technical Reports Server (NTRS)

Saini, Subhash

2005-01-01

I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).
Multitasking OS manages a team of processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ripps, D.L.

1983-07-21

MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
Launching applications on compute and service processors running under different operating systems in scalable network of processor boards with routers

DOEpatents

Tomkins, James L [Albuquerque, NM; Camp, William J [Albuquerque, NM

2009-03-17

A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure also permits easy physical scalability of the computing apparatus. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.
New computing systems and their impact on structural analysis and design

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1989-01-01

A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
An evaluation of MPI message rate on hybrid-core processors

DOE PAGES

Barrett, Brian W.; Brightwell, Ron; Grant, Ryan; ...

2014-11-01

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insightmore » into how MPI implementations for future hybrid-core processors should be designed.« less
Advanced On-Board Processor (AOP). [for future spacecraft applications

NASA Technical Reports Server (NTRS)

1973-01-01

Advanced On-board Processor the (AOP) uses large scale integration throughout and is the most advanced space qualified computer of its class in existence today. It was designed to satisfy most spacecraft requirements which are anticipated over the next several years. The AOP design utilizes custom metallized multigate arrays (CMMA) which have been designed specifically for this computer. This approach provides the most efficient use of circuits, reduces volume, weight, assembly costs and provides for a significant increase in reliability by the significant reduction in conventional circuit interconnections. The required 69 CMMA packages are assembled on a single multilayer printed circuit board which together with associated connectors constitutes the complete AOP. This approach also reduces conventional interconnections thus further reducing weight, volume and assembly costs.
Parallel processors and nonlinear structural dynamics algorithms and software

NASA Technical Reports Server (NTRS)

Belytschko, Ted

1990-01-01

Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.
Autonomous Telemetry Collection for Single-Processor Small Satellites

NASA Technical Reports Server (NTRS)

Speer, Dave

2003-01-01

For the Space Technology 5 mission, which is being developed under NASA's New Millennium Program, a single spacecraft processor will be required to do on-board real-time computations and operations associated with attitude control, up-link and down-link communications, science data processing, solid-state recorder management, power switching and battery charge management, experiment data collection, health and status data collection, etc. Much of the health and status information is in analog form, and each of the analog signals must be routed to the input of an analog-to-digital converter, converted to digital form, and then stored in memory. If the micro-operations of the analog data collection process are implemented in software, the processor may use up a lot of time either waiting for the analog signal to settle, waiting for the analog-to-digital conversion to complete, or servicing a large number of high frequency interrupts. In order to off-load a very busy processor, the collection and digitization of all analog spacecraft health and status data will be done autonomously by a field-programmable gate array that can configure the analog signal chain, control the analog-to-digital converter, and store the converted data in memory.
Parallel processor for real-time structural control

NASA Astrophysics Data System (ADS)

Tise, Bert L.

1993-07-01

A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-to-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection to host computer, parallelizing code generator, and look- up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating- point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An OpenWindows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.
Multimedia architectures: from desktop systems to portable appliances

NASA Astrophysics Data System (ADS)

Bhaskaran, Vasudev; Konstantinides, Konstantinos; Natarajan, Balas R.

1997-01-01

Future desktop and portable computing systems will have as their core an integrated multimedia system. Such a system will seamlessly combine digital video, digital audio, computer animation, text, and graphics. Furthermore, such a system will allow for mixed-media creation, dissemination, and interactive access in real time. Multimedia architectures that need to support these functions have traditionally required special display and processing units for the different media types. This approach tends to be expensive and is inefficient in its use of silicon. Furthermore, such media-specific processing units are unable to cope with the fluid nature of the multimedia market wherein the needs and standards are changing and system manufacturers may demand a single component media engine across a range of products. This constraint has led to a shift towards providing a single-component multimedia specific computing engine that can be integrated easily within desktop systems, tethered consumer appliances, or portable appliances. In this paper, we review some of the recent architectural efforts in developing integrated media systems. We primarily focus on two efforts, namely the evolution of multimedia-capable general purpose processors and a more recent effort in developing single component mixed media co-processors. Design considerations that could facilitate the migration of these technologies to a portable integrated media system also are presented.
ms2: A molecular simulation tool for thermodynamic properties

NASA Astrophysics Data System (ADS)

Deublein, Stephan; Eckl, Bernhard; Stoll, Jürgen; Lishchuk, Sergey V.; Guevara-Carrion, Gabriela; Glass, Colin W.; Merker, Thorsten; Bernreuther, Martin; Hasse, Hans; Vrabec, Jadran

2011-11-01

This work presents the molecular simulation program ms2 that is designed for the calculation of thermodynamic properties of bulk fluids in equilibrium consisting of small electro-neutral molecules. ms2 features the two main molecular simulation techniques, molecular dynamics (MD) and Monte-Carlo. It supports the calculation of vapor-liquid equilibria of pure fluids and multi-component mixtures described by rigid molecular models on the basis of the grand equilibrium method. Furthermore, it is capable of sampling various classical ensembles and yields numerous thermodynamic properties. To evaluate the chemical potential, Widom's test molecule method and gradual insertion are implemented. Transport properties are determined by equilibrium MD simulations following the Green-Kubo formalism. ms2 is designed to meet the requirements of academia and industry, particularly achieving short response times and straightforward handling. It is written in Fortran90 and optimized for a fast execution on a broad range of computer architectures, spanning from single processor PCs over PC-clusters and vector computers to high-end parallel machines. The standard Message Passing Interface (MPI) is used for parallelization and ms2 is therefore easily portable to different computing platforms. Feature tools facilitate the interaction with the code and the interpretation of input and output files. The accuracy and reliability of ms2 has been shown for a large variety of fluids in preceding work. Program summaryProgram title:ms2 Catalogue identifier: AEJF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Special Licence supplied by the authors No. of lines in distributed program, including test data, etc.: 82 794 No. of bytes in distributed program, including test data, etc.: 793 705 Distribution format: tar.gz Programming language: Fortran90 Computer: The simulation tool ms2 is usable on a wide variety of platforms, from single processor machines over PC-clusters and vector computers to vector-parallel architectures. (Tested with Fortran compilers: gfortran, Intel, PathScale, Portland Group and Sun Studio.) Operating system: Unix/Linux, Windows Has the code been vectorized or parallelized?: Yes. Message Passing Interface (MPI) protocol Scalability. Excellent scalability up to 16 processors for molecular dynamics and >512 processors for Monte-Carlo simulations. RAM:ms2 runs on single processors with 512 MB RAM. The memory demand rises with increasing number of processors used per node and increasing number of molecules. Classification: 7.7, 7.9, 12 External routines: Message Passing Interface (MPI) Nature of problem: Calculation of application oriented thermodynamic properties for rigid electro-neutral molecules: vapor-liquid equilibria, thermal and caloric data as well as transport properties of pure fluids and multi-component mixtures. Solution method: Molecular dynamics, Monte-Carlo, various classical ensembles, grand equilibrium method, Green-Kubo formalism. Restrictions: No. The system size is user-defined. Typical problems addressed by ms2 can be solved by simulating systems containing typically 2000 molecules or less. Unusual features: Feature tools are available for creating input files, analyzing simulation results and visualizing molecular trajectories. Additional comments: Sample makefiles for multiple operation platforms are provided. Documentation is provided with the installation package and is available at http://www.ms-2.de. Running time: The running time of ms2 depends on the problem set, the system size and the number of processes used in the simulation. Running four processes on a "Nehalem" processor, simulations calculating VLE data take between two and twelve hours, calculating transport properties between six and 24 hours.

Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

NASA Technical Reports Server (NTRS)

Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

2012-01-01

In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
High-performance computing for airborne applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quinn, Heather M; Manuzzato, Andrea; Fairbanks, Tom

2010-06-28

Recently, there has been attempts to move common satellite tasks to unmanned aerial vehicles (UAVs). UAVs are significantly cheaper to buy than satellites and easier to deploy on an as-needed basis. The more benign radiation environment also allows for an aggressive adoption of state-of-the-art commercial computational devices, which increases the amount of data that can be collected. There are a number of commercial computing devices currently available that are well-suited to high-performance computing. These devices range from specialized computational devices, such as field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), to traditional computing platforms, such as microprocessors. Even thoughmore » the radiation environment is relatively benign, these devices could be susceptible to single-event effects. In this paper, we will present radiation data for high-performance computing devices in a accelerated neutron environment. These devices include a multi-core digital signal processor, two field-programmable gate arrays, and a microprocessor. From these results, we found that all of these devices are suitable for many airplane environments without reliability problems.« less
Optical Interconnections for VLSI Computational Systems Using Computer-Generated Holography.

NASA Astrophysics Data System (ADS)

Feldman, Michael Robert

Optical interconnects for VLSI computational systems using computer generated holograms are evaluated in theory and experiment. It is shown that by replacing particular electronic connections with free-space optical communication paths, connection of devices on a single chip or wafer and between chips or modules can be improved. Optical and electrical interconnects are compared in terms of power dissipation, communication bandwidth, and connection density. Conditions are determined for which optical interconnects are advantageous. Based on this analysis, it is shown that by applying computer generated holographic optical interconnects to wafer scale fine grain parallel processing systems, dramatic increases in system performance can be expected. Some new interconnection networks, designed to take full advantage of optical interconnect technology, have been developed. Experimental Computer Generated Holograms (CGH's) have been designed, fabricated and subsequently tested in prototype optical interconnected computational systems. Several new CGH encoding methods have been developed to provide efficient high performance CGH's. One CGH was used to decrease the access time of a 1 kilobit CMOS RAM chip. Another was produced to implement the inter-processor communication paths in a shared memory SIMD parallel processor array.
MIT Lincoln Laboratory Takes the Mystery Out of Supercomupting

DTIC Science & Technology

2017-01-18

analysis, designing sensors, and developing algorithms. In 2008, the Lincoln demonstrated the largest single problem ever run on a computer using ... computation . As we design and prototype these devices, the use of leading–edge engineering practices have become the de facto standard. This includes...MIT Lincoln Laboratory Takes the Mystery Out of Supercomputing By Dr. Jeremy Kepner 1 The introduction of multicore and manycore processors
The Impact of IBM Cell Technology on the Programming Paradigm in the Context of Computer Systems for Climate and Weather Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhou, Shujia; Duffy, Daniel; Clune, Thomas

The call for ever-increasing model resolutions and physical processes in climate and weather models demands a continual increase in computing power. The IBM Cell processor's order-of-magnitude peak performance increase over conventional processors makes it very attractive to fulfill this requirement. However, the Cell's characteristics, 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. As a trial, we selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (half the total computational time), (2) has an extremely high computational intensity: the ratiomore » of computational load to main memory transfers, and (3) exhibits embarrassingly parallel column computations. In this paper, we converted the baseline code (single-precision Fortran) to C and ported it to an IBM BladeCenter QS20. For performance, we manually SIMDize four independent columns and include several unrolling optimizations. Our results show that when compared with the baseline implementation running on one core of Intel's Xeon Woodcrest, Dempsey, and Itanium2, the Cell is approximately 8.8x, 11.6x, and 12.8x faster, respectively. Our preliminary analysis shows that the Cell can also accelerate the dynamics component (~;;25percent total computational time). We believe these dramatic performance improvements make the Cell processor very competitive as an accelerator.« less
Distributed computation of graphics primitives on a transputer network

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

A method is developed for distributing the computation of graphics primitives on a parallel processing network. Off-the-shelf transputer boards are used to perform the graphics transformations and scan-conversion tasks that would normally be assigned to a single transputer based display processor. Each node in the network performs a single graphics primitive computation. Frequently requested tasks can be duplicated on several nodes. The results indicate that the current distribution of commands on the graphics network shows a performance degradation when compared to the graphics display board alone. A change to more computation per node for every communication (perform more complex tasks on each node) may cause the desired increase in throughput.
Parallel processor for real-time structural control

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tise, B.L.

1992-01-01

A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection tomore » host computer, parallelizing code generator, and look-up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating-point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An Open Windows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.« less
Finite element computation on nearest neighbor connected machines

NASA Technical Reports Server (NTRS)

Mcaulay, A. D.

1984-01-01

Research aimed at faster, more cost effective parallel machines and algorithms for improving designer productivity with finite element computations is discussed. A set of 8 boards, containing 4 nearest neighbor connected arrays of commercially available floating point chips and substantial memory, are inserted into a commercially available machine. One-tenth Mflop (64 bit operation) processors provide an 89% efficiency when solving the equations arising in a finite element problem for a single variable regular grid of size 40 by 40 by 40. This is approximately 15 to 20 times faster than a much more expensive machine such as a VAX 11/780 used in double precision. The efficiency falls off as faster or more processors are envisaged because communication times become dominant. A novel successive overrelaxation algorithm which uses cyclic reduction in order to permit data transfer and computation to overlap in time is proposed.
Accuracy of the lattice-Boltzmann method using the Cell processor

NASA Astrophysics Data System (ADS)

Harvey, M. J.; de Fabritiis, G.; Giupponi, G.

2008-11-01

Accelerator processors like the new Cell processor are extending the traditional platforms for scientific computation, allowing orders of magnitude more floating-point operations per second (flops) compared to standard central processing units. However, they currently lack double-precision support and support for some IEEE 754 capabilities. In this work, we develop a lattice-Boltzmann (LB) code to run on the Cell processor and test the accuracy of this lattice method on this platform. We run tests for different flow topologies, boundary conditions, and Reynolds numbers in the range Re=6 350 . In one case, simulation results show a reduced mass and momentum conservation compared to an equivalent double-precision LB implementation. All other cases demonstrate the utility of the Cell processor for fluid dynamics simulations. Benchmarks on two Cell-based platforms are performed, the Sony Playstation3 and the QS20/QS21 IBM blade, obtaining a speed-up factor of 7 and 21, respectively, compared to the original PC version of the code, and a conservative sustained performance of 28 gigaflops per single Cell processor. Our results suggest that choice of IEEE 754 rounding mode is possibly as important as double-precision support for this specific scientific application.
Towards the formal specification of the requirements and design of a processor interface unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

Work to formally specify the requirements and design of a Processor Interface Unit (PIU), a single-chip subsystem providing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system, is described. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance free operation, or both. The approaches that were developed for modeling the PIU requirements and for composition of the PIU subcomponents at high levels of abstraction are described. These approaches were used to specify and verify a nontrivial subset of the PIU behavior. The PIU specification in Higher Order Logic (HOL) is documented in a companion NASA contractor report entitled 'Towards the Formal Specification of the Requirements and Design of a Processor Interfacs Unit - HOL Listings.' The subsequent verification approach and HOL listings are documented in NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit' and NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings.'
Flight design system level C requirements. Solid rocket booster and external tank impact prediction processors. [space transportation system

NASA Technical Reports Server (NTRS)

Seale, R. H.

1979-01-01

The prediction of the SRB and ET impact areas requires six separate processors. The SRB impact prediction processor computes the impact areas and related trajectory data for each SRB element. Output from this processor is stored on a secure file accessible by the SRB impact plot processor which generates the required plots. Similarly the ET RTLS impact prediction processor and the ET RTLS impact plot processor generates the ET impact footprints for return-to-launch-site (RTLS) profiles. The ET nominal/AOA/ATO impact prediction processor and the ET nominal/AOA/ATO impact plot processor generate the ET impact footprints for non-RTLS profiles. The SRB and ET impact processors compute the size and shape of the impact footprints by tabular lookup in a stored footprint dispersion data base. The location of each footprint is determined by simulating a reference trajectory and computing the reference impact point location. To insure consistency among all flight design system (FDS) users, much input required by these processors will be obtained from the FDS master data base.
Experience with a Genetic Algorithm Implemented on a Multiprocessor Computer

NASA Technical Reports Server (NTRS)

Plassman, Gerald E.; Sobieszczanski-Sobieski, Jaroslaw

2000-01-01

Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the non-distributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent.
Predicting Cost/Performance Trade-Offs for Whitney: A Commodity Computing Cluster

NASA Technical Reports Server (NTRS)

Becker, Jeffrey C.; Nitzberg, Bill; VanderWijngaart, Rob F.; Kutler, Paul (Technical Monitor)

1997-01-01

Recent advances in low-end processor and network technology have made it possible to build a "supercomputer" out of commodity components. We develop simple models of the NAS Parallel Benchmarks version 2 (NPB 2) to explore the cost/performance trade-offs involved in building a balanced parallel computer supporting a scientific workload. We develop closed form expressions detailing the number and size of messages sent by each benchmark. Coupling these with measured single processor performance, network latency, and network bandwidth, our models predict benchmark performance to within 30%. A comparison based on total system cost reveals that current commodity technology (200 MHz Pentium Pros with 100baseT Ethernet) is well balanced for the NPBs up to a total system cost of around $1,000,000.
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Zubair, Mohammad

1993-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Development of an Autonomous Navigation Technology Test Vehicle

DTIC Science & Technology

2004-08-01

as an independent thread on processors using the Linux operating system. The computer hardware selected for the nodes that host the MRS threads...communications system design. Linux was chosen as the operating system for all of the single board computers used on the Mule. Linux was specifically...used for system analysis and development. The simple realization of multi-thread processing and inter-process communications in Linux made it a
QCDOC: A 10-teraflops scale computer for lattice QCD

NASA Astrophysics Data System (ADS)

Chen, D.; Christ, N. H.; Cristian, C.; Dong, Z.; Gara, A.; Garg, K.; Joo, B.; Kim, C.; Levkova, L.; Liao, X.; Mawhinney, R. D.; Ohta, S.; Wettig, T.

2001-03-01

The architecture of a new class of computers, optimized for lattice QCD calculations, is described. An individual node is based on a single integrated circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor communications and additional control and diagnostic circuitry. The machine's name, QCDOC, derives from "QCD On a Chip".
A High Performance VLSI Computer Architecture For Computer Graphics

NASA Astrophysics Data System (ADS)

Chin, Chi-Yuan; Lin, Wen-Tai

1988-10-01

A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.

Lambda network having 2.sup.m-1 nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, Jr., Leonard M.

1995-01-01

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance.
A two-qubit photonic quantum processor and its application to solving systems of linear equations

PubMed Central

Barz, Stefanie; Kassal, Ivan; Ringbauer, Martin; Lipp, Yannick Ole; Dakić, Borivoje; Aspuru-Guzik, Alán; Walther, Philip

2014-01-01

Large-scale quantum computers will require the ability to apply long sequences of entangling gates to many qubits. In a photonic architecture, where single-qubit gates can be performed easily and precisely, the application of consecutive two-qubit entangling gates has been a significant obstacle. Here, we demonstrate a two-qubit photonic quantum processor that implements two consecutive CNOT gates on the same pair of polarisation-encoded qubits. To demonstrate the flexibility of our system, we implement various instances of the quantum algorithm for solving of systems of linear equations. PMID:25135432
Slime mould processors, logic gates and sensors.

PubMed

Adamatzky, A

2015-07-28

A heterotic, or hybrid, computation implies that two or more substrates of different physical nature are merged into a single device with indistinguishable parts. These hybrid devices then undertake coherent acts on programmable and sensible processing of information. We study the potential of heterotic computers using slime mould acting under the guidance of chemical, mechanical and optical stimuli. Plasmodium of acellular slime mould Physarum polycephalum is a gigantic single cell visible to the unaided eye. The cell shows a rich spectrum of behavioural morphological patterns in response to changing environmental conditions. Given data represented by chemical or physical stimuli, we can employ and modify the behaviour of the slime mould to make it solve a range of computing and sensing tasks. We overview results of laboratory experimental studies on prototyping of the slime mould morphological processors for approximation of Voronoi diagrams, planar shapes and solving mazes, and discuss logic gates implemented via collision of active growing zones and tactile responses of P. polycephalum. We also overview a range of electronic components--memristor, chemical, tactile and colour sensors-made of the slime mould. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
AIC Computations Using Navier-Stokes Equations on Single Image Supercomputers For Design Optimization

NASA Technical Reports Server (NTRS)

Guruswamy, Guru

2004-01-01

A procedure to accurately generate AIC using the Navier-Stokes solver including grid deformation is presented. Preliminary results show good comparisons between experiment and computed flutter boundaries for a rectangular wing. A full wing body configuration of an orbital space plane is selected for demonstration on a large number of processors. In the final paper the AIC of full wing body configuration will be computed. The scalability of the procedure on supercomputer will be demonstrated.
Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

PubMed

Komarov, Ivan; D'Souza, Roshan M

2012-01-01

The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

NASA Astrophysics Data System (ADS)

Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

2014-03-01

List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
Many-core computing for space-based stereoscopic imaging

NASA Astrophysics Data System (ADS)

McCall, Paul; Torres, Gildo; LeGrand, Keith; Adjouadi, Malek; Liu, Chen; Darling, Jacob; Pernicka, Henry

The potential benefits of using parallel computing in real-time visual-based satellite proximity operations missions are investigated. Improvements in performance and relative navigation solutions over single thread systems can be achieved through multi- and many-core computing. Stochastic relative orbit determination methods benefit from the higher measurement frequencies, allowing them to more accurately determine the associated statistical properties of the relative orbital elements. More accurate orbit determination can lead to reduced fuel consumption and extended mission capabilities and duration. Inherent to the process of stereoscopic image processing is the difficulty of loading, managing, parsing, and evaluating large amounts of data efficiently, which may result in delays or highly time consuming processes for single (or few) processor systems or platforms. In this research we utilize the Single-Chip Cloud Computer (SCC), a fully programmable 48-core experimental processor, created by Intel Labs as a platform for many-core software research, provided with a high-speed on-chip network for sharing information along with advanced power management technologies and support for message-passing. The results from utilizing the SCC platform for the stereoscopic image processing application are presented in the form of Performance, Power, Energy, and Energy-Delay-Product (EDP) metrics. Also, a comparison between the SCC results and those obtained from executing the same application on a commercial PC are presented, showing the potential benefits of utilizing the SCC in particular, and any many-core platforms in general for real-time processing of visual-based satellite proximity operations missions.
A universal computer control system for motors

NASA Technical Reports Server (NTRS)

Szakaly, Zoltan F. (Inventor)

1991-01-01

A control system for a multi-motor system such as a space telerobot, having a remote computational node and a local computational node interconnected with one another by a high speed data link is described. A Universal Computer Control System (UCCS) for the telerobot is located at each node. Each node is provided with a multibus computer system which is characterized by a plurality of processors with all processors being connected to a common bus, and including at least one command processor. The command processor communicates over the bus with a plurality of joint controller cards. A plurality of direct current torque motors, of the type used in telerobot joints and telerobot hand-held controllers, are connected to the controller cards and responds to digital control signals from the command processor. Essential motor operating parameters are sensed by analog sensing circuits and the sensed analog signals are converted to digital signals for storage at the controller cards where such signals can be read during an address read/write cycle of the command processing processor.
Comparative Implementation of High Performance Computing for Power System Dynamic Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Huang, Zhenyu; Diao, Ruisheng

Dynamic simulation for transient stability assessment is one of the most important, but intensive, computations for power system planning and operation. Present commercial software is mainly designed for sequential computation to run a single simulation, which is very time consuming with a single processer. The application of High Performance Computing (HPC) to dynamic simulations is very promising in accelerating the computing process by parallelizing its kernel algorithms while maintaining the same level of computation accuracy. This paper describes the comparative implementation of four parallel dynamic simulation schemes in two state-of-the-art HPC environments: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP).more » These implementations serve to match the application with dedicated multi-processor computing hardware and maximize the utilization and benefits of HPC during the development process.« less
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1999-01-01

The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

1984-01-01

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
A case study for the real-time experimental evaluation of the VIPER microprocessor

NASA Astrophysics Data System (ADS)

Carreno, Victor A.; Angellatta, Rob K.

1991-09-01

An experiment to evaluate the applicability of the Verifiable Integrated Processor for Enhanced Reliability (VIPER) microprocessor to real time control is described. The VIPER microprocessor was invented by the Royal Signals and Radar Establishment (RSRE), U.K., and is an example of the use of formal mathematical methods for developing electronic digital systems with a high degree of assurance on the system design and implementation correctness. The experiment consisted of selecting a control law, writing the control law algorithm for the VIPER processor, and providing real time, dynamic inputs into the processor and monitoring the outputs. The control law selected and coded for the VIPER processor was the yaw damper function of an automatic landing program for a 737 aircraft. The mechanisms for interfacing the VIPER Single Board Computer to the VAX host are described. Results include run time experiences, performance evaluation, and comparison of VIPER and FORTRAN yaw damper algorithm output for accuracy estimation.
A case study for the real-time experimental evaluation of the VIPER microprocessor

NASA Technical Reports Server (NTRS)

Carreno, Victor A.; Angellatta, Rob K.

1991-01-01

An experiment to evaluate the applicability of the Verifiable Integrated Processor for Enhanced Reliability (VIPER) microprocessor to real time control is described. The VIPER microprocessor was invented by the Royal Signals and Radar Establishment (RSRE), U.K., and is an example of the use of formal mathematical methods for developing electronic digital systems with a high degree of assurance on the system design and implementation correctness. The experiment consisted of selecting a control law, writing the control law algorithm for the VIPER processor, and providing real time, dynamic inputs into the processor and monitoring the outputs. The control law selected and coded for the VIPER processor was the yaw damper function of an automatic landing program for a 737 aircraft. The mechanisms for interfacing the VIPER Single Board Computer to the VAX host are described. Results include run time experiences, performance evaluation, and comparison of VIPER and FORTRAN yaw damper algorithm output for accuracy estimation.
Using all of your CPU's in HIPE

NASA Astrophysics Data System (ADS)

Jacobson, J. D.; Fadda, D.

2012-09-01

Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
Optimization of Car Body under Constraints of Noise, Vibration, and Harshness (NVH), and Crash

NASA Technical Reports Server (NTRS)

Kodiyalam, Srinivas; Yang, Ren-Jye; Sobieszczanski-Sobieski, Jaroslaw (Editor)

2000-01-01

To be competitive on the today's market, cars have to be as light as possible while meeting the Noise, Vibration, and Harshness (NVH) requirements and conforming to Government-man dated crash survival regulations. The latter are difficult to meet because they involve very compute-intensive, nonlinear analysis, e.g., the code RADIOSS capable of simulation of the dynamics, and the geometrical and material nonlinearities of a thin-walled car structure in crash, would require over 12 days of elapsed time for a single design of a 390K elastic degrees of freedom model, if executed on a single processor of the state-of-the-art SGI Origin2000 computer. Of course, in optimization that crash analysis would have to be invoked many times. Needless to say, that has rendered such optimization intractable until now. The car finite element model is shown. The advent of computers that comprise large numbers of concurrently operating processors has created a new environment wherein the above optimization, and other engineering problems heretofore regarded as intractable may be solved. The procedure, shown, is a piecewise approximation based method and involves using a sensitivity based Taylor series approximation model for NVH and a polynomial response surface model for Crash. In that method the NVH constraints are evaluated using a finite element code (MSC/NASTRAN) that yields the constraint values and their derivatives with respect to design variables. The crash constraints are evaluated using the explicit code RADIOSS on the Origin 2000 operating on 256 processors simultaneously to generate data for a polynomial response surface in the design variable domain. The NVH constraints and their derivatives combined with the response surface for the crash constraints form an approximation to the system analysis (surrogate analysis) that enables a cycle of multidisciplinary optimization within move limits. In the inner loop, the NVH sensitivities are recomputed to update the NVH approximation model while keeping the Crash response surface constant. In every outer loop, the Crash response surface approximation is updated, including a gradual increase in the order of the response surface and the response surface extension in the direction of the search. In this optimization task, the NVH discipline has 30 design variables while the crash discipline has 20 design variables. A subset of these design variables (10) are common to both the NVH and crash disciplines. In order to construct a linear response surface for the Crash discipline constraints, a minimum of 21 design points would have to be analyzed using the RADIOSS code. On a single processor in Origin 2000 that amount of computing would require over 9 months! In this work, these runs were carried out concurrently on the Origin 2000 using multiple processors, ranging from 8 to 16, for each crash (RADIOSS) analysis. Another figure shows the wall time required for a single RADIOSS analysis using varying number of processors, as well as provides a comparison of 2 different common data placement procedures within the allotted memories for each analysis. The initial design is an infeasible design with NVH discipline Static Torsion constraint violations of over 10%. The final optimized design is a feasible design with a weight reduction of 15 kg compared to the initial design. This work demonstrates how advanced methodology for optimization combined with the technology of concurrent processing enables applications that until now were out of reach because of very long time-to-solution.
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-02-15

We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

DOE Office of Scientific and Technical Information (OSTI.GOV)

Madduri, Kamesh; Ediger, David; Jiang, Karl

2009-05-29

We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
LANDSAT-D flight segment operations manual. Appendix B: OBC software operations

NASA Technical Reports Server (NTRS)

Talipsky, R.

1981-01-01

The LANDSAT 4 satellite contains two NASA standard spacecraft computers and 65,536 words of memory. Onboard computer software is divided into flight executive and applications processors. Both applications processors and the flight executive use one or more of 67 system tables to obtain variables, constants, and software flags. Output from the software for monitoring operation is via 49 OBC telemetry reports subcommutated in the spacecraft telemetry. Information is provided about the flight software as it is used to control the various spacecraft operations and interpret operational OBC telemetry. Processor function descriptions, processor operation, software constraints, processor system tables, processor telemetry, and processor flow charts are presented.
Accelerated Adaptive MGS Phase Retrieval

NASA Technical Reports Server (NTRS)

Lam, Raymond K.; Ohara, Catherine M.; Green, Joseph J.; Bikkannavar, Siddarayappa A.; Basinger, Scott A.; Redding, David C.; Shi, Fang

2011-01-01

The Modified Gerchberg-Saxton (MGS) algorithm is an image-based wavefront-sensing method that can turn any science instrument focal plane into a wavefront sensor. MGS characterizes optical systems by estimating the wavefront errors in the exit pupil using only intensity images of a star or other point source of light. This innovative implementation of MGS significantly accelerates the MGS phase retrieval algorithm by using stream-processing hardware on conventional graphics cards. Stream processing is a relatively new, yet powerful, paradigm to allow parallel processing of certain applications that apply single instructions to multiple data (SIMD). These stream processors are designed specifically to support large-scale parallel computing on a single graphics chip. Computationally intensive algorithms, such as the Fast Fourier Transform (FFT), are particularly well suited for this computing environment. This high-speed version of MGS exploits commercially available hardware to accomplish the same objective in a fraction of the original time. The exploit involves performing matrix calculations in nVidia graphic cards. The graphical processor unit (GPU) is hardware that is specialized for computationally intensive, highly parallel computation. From the software perspective, a parallel programming model is used, called CUDA, to transparently scale multicore parallelism in hardware. This technology gives computationally intensive applications access to the processing power of the nVidia GPUs through a C/C++ programming interface. The AAMGS (Accelerated Adaptive MGS) software takes advantage of these advanced technologies, to accelerate the optical phase error characterization. With a single PC that contains four nVidia GTX-280 graphic cards, the new implementation can process four images simultaneously to produce a JWST (James Webb Space Telescope) wavefront measurement 60 times faster than the previous code.
LDPC decoder with a limited-precision FPGA-based floating-point multiplication coprocessor

NASA Astrophysics Data System (ADS)

Moberly, Raymond; O'Sullivan, Michael; Waheed, Khurram

2007-09-01

Implementing the sum-product algorithm, in an FPGA with an embedded processor, invites us to consider a tradeoff between computational precision and computational speed. The algorithm, known outside of the signal processing community as Pearl's belief propagation, is used for iterative soft-decision decoding of LDPC codes. We determined the feasibility of a coprocessor that will perform product computations. Our FPGA-based coprocessor (design) performs computer algebra with significantly less precision than the standard (e.g. integer, floating-point) operations of general purpose processors. Using synthesis, targeting a 3,168 LUT Xilinx FPGA, we show that key components of a decoder are feasible and that the full single-precision decoder could be constructed using a larger part. Soft-decision decoding by the iterative belief propagation algorithm is impacted both positively and negatively by a reduction in the precision of the computation. Reducing precision reduces the coding gain, but the limited-precision computation can operate faster. A proposed solution offers custom logic to perform computations with less precision, yet uses the floating-point format to interface with the software. Simulation results show the achievable coding gain. Synthesis results help theorize the the full capacity and performance of an FPGA-based coprocessor.

Documentary table-top view of a comparison of the General Purpose Computers.

NASA Image and Video Library

1988-09-13

S88-47513 (Aug 1988) --- The current and future versions of general purpose computers for Space Shuttle orbiters are represented in this frame. The two boxes on the left (AP101B) represent the current GPC configuration, with the input-output processor at far left and the central processing unit at its side. The upgraded version combines both elements in a single unit (far right, AP101S).
Design for a Manufacturing Method for Memristor-Based Neuromorphic Computing Processors

DTIC Science & Technology

2013-03-01

DESIGN FOR A MANUFACTURING METHOD FOR MEMRISTOR- BASED NEUROMORPHIC COMPUTING PROCESSORS UNIVERSITY OF PITTSBURGH MARCH 2013...BASED NEUROMORPHIC COMPUTING PROCESSORS 5a. CONTRACT NUMBER FA8750-11-1-0271 5b. GRANT NUMBER N/A 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S...synapses and implemented a neuromorphic computing system based on our proposed synapse designs. The robustness of our system is also evaluated by
High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Conditional load and store in a shared memory

DOEpatents

Blumrich, Matthias A; Ohmacht, Martin

2015-02-03

A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
NCC Simulation Model: Simulating the operations of the network control center, phase 2

NASA Technical Reports Server (NTRS)

Benjamin, Norman M.; Paul, Arthur S.; Gill, Tepper L.

1992-01-01

The simulation of the network control center (NCC) is in the second phase of development. This phase seeks to further develop the work performed in phase one. Phase one concentrated on the computer systems and interconnecting network. The focus of phase two will be the implementation of the network message dialogues and the resources controlled by the NCC. These resources are requested, initiated, monitored and analyzed via network messages. In the NCC network messages are presented in the form of packets that are routed across the network. These packets are generated, encoded, decoded and processed by the network host processors that generate and service the message traffic on the network that connects these hosts. As a result, the message traffic is used to characterize the work done by the NCC and the connected network. Phase one of the model development represented the NCC as a network of bi-directional single server queues and message generating sources. The generators represented the external segment processors. The served based queues represented the host processors. The NCC model consists of the internal and external processors which generate message traffic on the network that links these hosts. To fully realize the objective of phase two it is necessary to identify and model the processes in each internal processor. These processes live in the operating system of the internal host computers and handle tasks such as high speed message exchanging, ISN and NFE interface, event monitoring, network monitoring, and message logging. Inter process communication is achieved through the operating system facilities. The overall performance of the host is determined by its ability to service messages generated by both internal and external processors.
Reactor Dosimetry Applications Using RAPTOR-M3G:. a New Parallel 3-D Radiation Transport Code

NASA Astrophysics Data System (ADS)

Longoni, Gianluca; Anderson, Stanwood L.

2009-08-01

The numerical solution of the Linearized Boltzmann Equation (LBE) via the Discrete Ordinates method (SN) requires extensive computational resources for large 3-D neutron and gamma transport applications due to the concurrent discretization of the angular, spatial, and energy domains. This paper will discuss the development RAPTOR-M3G (RApid Parallel Transport Of Radiation - Multiple 3D Geometries), a new 3-D parallel radiation transport code, and its application to the calculation of ex-vessel neutron dosimetry responses in the cavity of a commercial 2-loop Pressurized Water Reactor (PWR). RAPTOR-M3G is based domain decomposition algorithms, where the spatial and angular domains are allocated and processed on multi-processor computer architectures. As compared to traditional single-processor applications, this approach reduces the computational load as well as the memory requirement per processor, yielding an efficient solution methodology for large 3-D problems. Measured neutron dosimetry responses in the reactor cavity air gap will be compared to the RAPTOR-M3G predictions. This paper is organized as follows: Section 1 discusses the RAPTOR-M3G methodology; Section 2 describes the 2-loop PWR model and the numerical results obtained. Section 3 addresses the parallel performance of the code, and Section 4 concludes this paper with final remarks and future work.
Performance prediction: A case study using a multi-ring KSR-1 machine

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1995-01-01

While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less
System-wide power management control via clock distribution network

DOEpatents

Coteus, Paul W.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Reed, Don D.

2015-05-19

An apparatus, method and computer program product for automatically controlling power dissipation of a parallel computing system that includes a plurality of processors. A computing device issues a command to the parallel computing system. A clock pulse-width modulator encodes the command in a system clock signal to be distributed to the plurality of processors. The plurality of processors in the parallel computing system receive the system clock signal including the encoded command, and adjusts power dissipation according to the encoded command.
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
A pluggable framework for parallel pairwise sequence search.

PubMed

Archuleta, Jeremy; Feng, Wu-chun; Tilevich, Eli

2007-01-01

The current and near future of the computing industry is one of multi-core and multi-processor technology. Most existing sequence-search tools have been designed with a focus on single-core, single-processor systems. This discrepancy between software design and hardware architecture substantially hinders sequence-search performance by not allowing full utilization of the hardware. This paper presents a novel framework that will aid the conversion of serial sequence-search tools into a parallel version that can take full advantage of the available hardware. The framework, which is based on a software architecture called mixin layers with refined roles, enables modules to be plugged into the framework with minimal effort. The inherent modular design improves maintenance and extensibility, thus opening up a plethora of opportunities for advanced algorithmic features to be developed and incorporated while routine maintenance of the codebase persists.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.

PubMed

Scharfe, Michael; Pielot, Rainer; Schreiber, Falk

2010-01-11

Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
Accuracy requirements of optical linear algebra processors in adaptive optics imaging systems

NASA Technical Reports Server (NTRS)

Downie, John D.; Goodman, Joseph W.

1989-01-01

The accuracy requirements of optical processors in adaptive optics systems are determined by estimating the required accuracy in a general optical linear algebra processor (OLAP) that results in a smaller average residual aberration than that achieved with a conventional electronic digital processor with some specific computation speed. Special attention is given to an error analysis of a general OLAP with regard to the residual aberration that is created in an adaptive mirror system by the inaccuracies of the processor, and to the effect of computational speed of an electronic processor on the correction. Results are presented on the ability of an OLAP to compete with a digital processor in various situations.
Lambda network having 2{sup m{minus}1} nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, L.M. Jr.

1995-11-28

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance. 14 figs.
Dedicated hardware processor and corresponding system-on-chip design for real-time laser speckle imaging.

PubMed

Jiang, Chao; Zhang, Hongyan; Wang, Jia; Wang, Yaru; He, Heng; Liu, Rui; Zhou, Fangyuan; Deng, Jialiang; Li, Pengcheng; Luo, Qingming

2011-11-01

Laser speckle imaging (LSI) is a noninvasive and full-field optical imaging technique which produces two-dimensional blood flow maps of tissues from the raw laser speckle images captured by a CCD camera without scanning. We present a hardware-friendly algorithm for the real-time processing of laser speckle imaging. The algorithm is developed and optimized specifically for LSI processing in the field programmable gate array (FPGA). Based on this algorithm, we designed a dedicated hardware processor for real-time LSI in FPGA. The pipeline processing scheme and parallel computing architecture are introduced into the design of this LSI hardware processor. When the LSI hardware processor is implemented in the FPGA running at the maximum frequency of 130 MHz, up to 85 raw images with the resolution of 640×480 pixels can be processed per second. Meanwhile, we also present a system on chip (SOC) solution for LSI processing by integrating the CCD controller, memory controller, LSI hardware processor, and LCD display controller into a single FPGA chip. This SOC solution also can be used to produce an application specific integrated circuit for LSI processing.
Architecture design of the multi-functional wavelet-based ECG microprocessor for realtime detection of abnormal cardiac events.

PubMed

Cheng, Li-Fang; Chen, Tung-Chien; Chen, Liang-Gee

2012-01-01

Most of the abnormal cardiac events such as myocardial ischemia, acute myocardial infarction (AMI) and fatal arrhythmia can be diagnosed through continuous electrocardiogram (ECG) analysis. According to recent clinical research, early detection and alarming of such cardiac events can reduce the time delay to the hospital, and the clinical outcomes of these individuals can be greatly improved. Therefore, it would be helpful if there is a long-term ECG monitoring system with the ability to identify abnormal cardiac events and provide realtime warning for the users. The combination of the wireless body area sensor network (BASN) and the on-sensor ECG processor is a possible solution for this application. In this paper, we aim to design and implement a digital signal processor that is suitable for continuous ECG monitoring and alarming based on the continuous wavelet transform (CWT) through the proposed architectures--using both programmable RISC processor and application specific integrated circuits (ASIC) for performance optimization. According to the implementation results, the power consumption of the proposed processor integrated with an ASIC for CWT computation is only 79.4 mW. Compared with the single-RISC processor, about 91.6% of the power reduction is achieved.
Programs for Testing Processor-in-Memory Computing Systems

NASA Technical Reports Server (NTRS)

Katz, Daniel S.

2006-01-01

The Multithreaded Microbenchmarks for Processor-In-Memory (PIM) Compilers, Simulators, and Hardware are computer programs arranged in a series for use in testing the performances of PIM computing systems, including compilers, simulators, and hardware. The programs at the beginning of the series test basic functionality; the programs at subsequent positions in the series test increasingly complex functionality. The programs are intended to be used while designing a PIM system, and can be used to verify that compilers, simulators, and hardware work correctly. The programs can also be used to enable designers of these system components to examine tradeoffs in implementation. Finally, these programs can be run on non-PIM hardware (either single-threaded or multithreaded) using the POSIX pthreads standard to verify that the benchmarks themselves operate correctly. [POSIX (Portable Operating System Interface for UNIX) is a set of standards that define how programs and operating systems interact with each other. pthreads is a library of pre-emptive thread routines that comply with one of the POSIX standards.
Real time emotion aware applications: a case study employing emotion evocative pictures and neuro-physiological sensing enhanced by Graphic Processor Units.

PubMed

Konstantinidis, Evdokimos I; Frantzidis, Christos A; Pappas, Costas; Bamidis, Panagiotis D

2012-07-01

In this paper the feasibility of adopting Graphic Processor Units towards real-time emotion aware computing is investigated for boosting the time consuming computations employed in such applications. The proposed methodology was employed in analysis of encephalographic and electrodermal data gathered when participants passively viewed emotional evocative stimuli. The GPU effectiveness when processing electroencephalographic and electrodermal recordings is demonstrated by comparing the execution time of chaos/complexity analysis through nonlinear dynamics (multi-channel correlation dimension/D2) and signal processing algorithms (computation of skin conductance level/SCL) into various popular programming environments. Apart from the beneficial role of parallel programming, the adoption of special design techniques regarding memory management may further enhance the time minimization which approximates a factor of 30 in comparison with ANSI C language (single-core sequential execution). Therefore, the use of GPU parallel capabilities offers a reliable and robust solution for real-time sensing the user's affective state. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
High performance in silico virtual drug screening on many-core processors.

PubMed

McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-05-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors

PubMed Central

Price, James; Sessions, Richard B; Ibarra, Amaurys A

2015-01-01

Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727

Digital Waveguide Architectures for Virtual Musical Instruments

NASA Astrophysics Data System (ADS)

Smith, Julius O.

Digital sound synthesis has become a standard staple of modern music studios, videogames, personal computers, and hand-held devices. As processing power has increased over the years, sound synthesis implementations have evolved from dedicated chip sets, to single-chip solutions, and ultimately to software implementations within processors used primarily for other tasks (such as for graphics or general purpose computing). With the cost of implementation dropping closer and closer to zero, there is increasing room for higher quality algorithms.
System balance analysis for vector computers

NASA Technical Reports Server (NTRS)

Knight, J. C.; Poole, W. G., Jr.; Voight, R. G.

1975-01-01

The availability of vector processors capable of sustaining computing rates of 10 to the 8th power arithmetic results pers second raised the question of whether peripheral storage devices representing current technology can keep such processors supplied with data. By examining the solution of a large banded linear system on these computers, it was found that even under ideal conditions, the processors will frequently be waiting for problem data.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Krityakierne, Tipaluck; Akhtar, Taimoor; Shoemaker, Christine A.

This paper presents a parallel surrogate-based global optimization method for computationally expensive objective functions that is more effective for larger numbers of processors. To reach this goal, we integrated concepts from multi-objective optimization and tabu search into, single objective, surrogate optimization. Our proposed derivative-free algorithm, called SOP, uses non-dominated sorting of points for which the expensive function has been previously evaluated. The two objectives are the expensive function value of the point and the minimum distance of the point to previously evaluated points. Based on the results of non-dominated sorting, P points from the sorted fronts are selected as centersmore » from which many candidate points are generated by random perturbations. Based on surrogate approximation, the best candidate point is subsequently selected for expensive evaluation for each of the P centers, with simultaneous computation on P processors. Centers that previously did not generate good solutions are tabu with a given tenure. We show almost sure convergence of this algorithm under some conditions. The performance of SOP is compared with two RBF based methods. The test results show that SOP is an efficient method that can reduce time required to find a good near optimal solution. In a number of cases the efficiency of SOP is so good that SOP with 8 processors found an accurate answer in less wall-clock time than the other algorithms did with 32 processors.« less
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors

NASA Astrophysics Data System (ADS)

Khajeh-Saeed, Ali; Poole, Stephen; Blair Perot, J.

2010-06-01

Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases. The Smith-Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. Like many applications in computational science, the Smith-Waterman algorithm is constrained by the memory access speed and can be accelerated significantly by using graphics processors (GPUs) as the compute engine. In this work we show that effective use of the GPU requires a novel reformulation of the Smith-Waterman algorithm. The performance of this new version of the algorithm is demonstrated using the SSCA#1 (Bioinformatics) benchmark running on one GPU and on up to four GPUs executing in parallel. The results indicate that for large problems a single GPU is up to 45 times faster than a CPU for this application, and the parallel implementation shows linear speed up on up to 4 GPUs.
Single-Scale Retinex Using Digital Signal Processors

NASA Technical Reports Server (NTRS)

Hines, Glenn; Rahman, Zia-Ur; Jobson, Daniel; Woodell, Glenn

2005-01-01

The Retinex is an image enhancement algorithm that improves the brightness, contrast and sharpness of an image. It performs a non-linear spatial/spectral transform that provides simultaneous dynamic range compression and color constancy. It has been used for a wide variety of applications ranging from aviation safety to general purpose photography. Many potential applications require the use of Retinex processing at video frame rates. This is difficult to achieve with general purpose processors because the algorithm contains a large number of complex computations and data transfers. In addition, many of these applications also constrain the potential architectures to embedded processors to save power, weight and cost. Thus we have focused on digital signal processors (DSPs) and field programmable gate arrays (FPGAs) as potential solutions for real-time Retinex processing. In previous efforts we attained a 21 (full) frame per second (fps) processing rate for the single-scale monochromatic Retinex with a TMS320C6711 DSP operating at 150 MHz. This was achieved after several significant code improvements and optimizations. Since then we have migrated our design to the slightly more powerful TMS320C6713 DSP and the fixed point TMS320DM642 DSP. In this paper we briefly discuss the Retinex algorithm, the performance of the algorithm executing on the TMS320C6713 and the TMS320DM642, and compare the results with the TMS320C6711.
Advanced Multiple Processor Configuration Study. Final Report.

ERIC Educational Resources Information Center

Clymer, S. J.

This summary of a study on multiple processor configurations includes the objectives, background, approach, and results of research undertaken to provide the Air Force with a generalized model of computer processor combinations for use in the evaluation of proposed flight training simulator computational designs. An analysis of a real-time flight…
System support software for the Space Ultrareliable Modular Computer (SUMC)

NASA Technical Reports Server (NTRS)

Hill, T. E.; Hintze, G. C.; Hodges, B. C.; Austin, F. A.; Buckles, B. P.; Curran, R. T.; Lackey, J. D.; Payne, R. E.

1974-01-01

The highly transportable programming system designed and implemented to support the development of software for the Space Ultrareliable Modular Computer (SUMC) is described. The SUMC system support software consists of program modules called processors. The initial set of processors consists of the supervisor, the general purpose assembler for SUMC instruction and microcode input, linkage editors, an instruction level simulator, a microcode grid print processor, and user oriented utility programs. A FORTRAN 4 compiler is undergoing development. The design facilitates the addition of new processors with a minimum effort and provides the user quasi host independence on the ground based operational software development computer. Additional capability is provided to accommodate variations in the SUMC architecture without consequent major modifications in the initial processors.
The parallel algorithm for the 2D discrete wavelet transform

NASA Astrophysics Data System (ADS)

Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel

2018-04-01

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
Color sensor and neural processor on one chip

NASA Astrophysics Data System (ADS)

Fiesler, Emile; Campbell, Shannon R.; Kempem, Lother; Duong, Tuan A.

1998-10-01

Low-cost, compact, and robust color sensor that can operate in real-time under various environmental conditions can benefit many applications, including quality control, chemical sensing, food production, medical diagnostics, energy conservation, monitoring of hazardous waste, and recycling. Unfortunately, existing color sensor are either bulky and expensive or do not provide the required speed and accuracy. In this publication we describe the design of an accurate real-time color classification sensor, together with preprocessing and a subsequent neural network processor integrated on a single complementary metal oxide semiconductor (CMOS) integrated circuit. This one-chip sensor and information processor will be low in cost, robust, and mass-producible using standard commercial CMOS processes. The performance of the chip and the feasibility of its manufacturing is proven through computer simulations based on CMOS hardware parameters. Comparisons with competing methodologies show a significantly higher performance for our device.
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
Bus-Programmable Slave Card

NASA Technical Reports Server (NTRS)

Hall, William A.

1990-01-01

Slave microprocessors in multimicroprocessor computing system contains modified circuit cards programmed via bus connecting master processor with slave microprocessors. Enables interactive, microprocessor-based, single-loop control. Confers ability to load and run program from master/slave bus, without need for microprocessor development station. Tristate buffers latch all data and information on status. Slave central processing unit never connected directly to bus.
Multiprocessor programming environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, M.B.; Fornaro, R.

Programming tools and techniques have been well developed for traditional uniprocessor computer systems. The focus of this research project is on the development of a programming environment for a high speed real time heterogeneous multiprocessor system, with special emphasis on languages and compilers. The new tools and techniques will allow a smooth transition for programmers with experience only on single processor systems.
Fault-tolerant computer study. [logic designs for building block circuits

NASA Technical Reports Server (NTRS)

Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.

1981-01-01

A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.
System and method for controlling power consumption in a computer system based on user satisfaction

DOEpatents

Yang, Lei; Dick, Robert P; Chen, Xi; Memik, Gokhan; Dinda, Peter A; Shy, Alex; Ozisikyilmaz, Berkin; Mallik, Arindam; Choudhary, Alok

2014-04-22

Systems and methods for controlling power consumption in a computer system. For each of a plurality of interactive applications, the method changes a frequency at which a processor of the computer system runs, receives an indication of user satisfaction, determines a relationship between the changed frequency and the user satisfaction of the interactive application, and stores the determined relationship information. The determined relationship can distinguish between different users and different interactive applications. A frequency may be selected from the discrete frequencies at which the processor of the computer system runs based on the determined relationship information for a particular user and a particular interactive application running on the processor of the computer system. The processor may be adapted to run at the selected frequency.
Pausing and activating thread state upon pin assertion by external logic monitoring polling loop exit time condition

DOEpatents

Chen, Dong; Giampapa, Mark; Heidelberger, Philip; Ohmacht, Martin; Satterfield, David L; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan

2013-05-21

A system and method for enhancing performance of a computer which includes a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processer. The processor processes instructions from the program. A wait state in the processor waits for receiving specified data. A thread in the processor has a pause state wherein the processor waits for specified data. A pin in the processor initiates a return to an active state from the pause state for the thread. A logic circuit is external to the processor, and the logic circuit is configured to detect a specified condition. The pin initiates a return to the active state of the thread when the specified condition is detected using the logic circuit.
Implementation of MPEG-2 encoder to multiprocessor system using multiple MVPs (TMS320C80)

NASA Astrophysics Data System (ADS)

Kim, HyungSun; Boo, Kenny; Chung, SeokWoo; Choi, Geon Y.; Lee, YongJin; Jeon, JaeHo; Park, Hyun Wook

1997-05-01

This paper presents the efficient algorithm mapping for the real-time MPEG-2 encoding on the KAIST image computing system (KICS), which has a parallel architecture using five multimedia video processors (MVPs). The MVP is a general purpose digital signal processor (DSP) of Texas Instrument. It combines one floating-point processor and four fixed- point DSPs on a single chip. The KICS uses the MVP as a primary processing element (PE). Two PEs form a cluster, and there are two processing clusters in the KICS. Real-time MPEG-2 encoder is implemented through the spatial and the functional partitioning strategies. Encoding process of spatially partitioned half of the video input frame is assigned to ne processing cluster. Two PEs perform the functionally partitioned MPEG-2 encoding tasks in the pipelined operation mode. One PE of a cluster carries out the transform coding part and the other performs the predictive coding part of the MPEG-2 encoding algorithm. One MVP among five MVPs is used for system control and interface with host computer. This paper introduces an implementation of the MPEG-2 algorithm with a parallel processing architecture.
20-GFLOPS QR processor on a Xilinx Virtex-E FPGA

NASA Astrophysics Data System (ADS)

Walke, Richard L.; Smith, Robert W. M.; Lightbody, Gaye

2000-11-01

Adaptive beamforming can play an important role in sensor array systems in countering directional interference. In high-sample rate systems, such as radar and comms, the calculation of adaptive weights is a very computational task that requires highly parallel solutions. For systems where low power consumption and volume are important the only viable implementation is as an Application Specific Integrated Circuit (ASIC). However, the rapid advancement of Field Programmable Gate Array (FPGA) technology is enabling highly credible re-programmable solutions. In this paper we present the implementation of a scalable linear array processor for weight calculation using QR decomposition. We employ floating-point arithmetic with mantissa size optimized to the target application to minimize component size, and implement them as relationally placed macros (RPMs) on Xilinx Virtex FPGAs to achieve predictable dense layout and high-speed operation. We present results that show that 20GFLOPS of sustained computation on a single XCV3200E-8 Virtex-E FPGA is possible. We also describe the parameterized implementation of the floating-point operators and QR-processor, and the design methodology that enables us to rapidly generate complex FPGA implementations using the industry standard hardware description language VHDL.
Managing Power Heterogeneity

NASA Astrophysics Data System (ADS)

Pruhs, Kirk

A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.
Multibus-based parallel processor for simulation

NASA Technical Reports Server (NTRS)

Ogrady, E. P.; Wang, C.-H.

1983-01-01

A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.

Application of NASA General-Purpose Solver to Large-Scale Computations in Aeroacoustics

NASA Technical Reports Server (NTRS)

Watson, Willie R.; Storaasli, Olaf O.

2004-01-01

Of several iterative and direct equation solvers evaluated previously for computations in aeroacoustics, the most promising was the NASA-developed General-Purpose Solver (winner of NASA's 1999 software of the year award). This paper presents detailed, single-processor statistics of the performance of this solver, which has been tailored and optimized for large-scale aeroacoustic computations. The statistics, compiled using an SGI ORIGIN 2000 computer with 12 Gb available memory (RAM) and eight available processors, are the central processing unit time, RAM requirements, and solution error. The equation solver is capable of solving 10 thousand complex unknowns in as little as 0.01 sec using 0.02 Gb RAM, and 8.4 million complex unknowns in slightly less than 3 hours using all 12 Gb. This latter solution is the largest aeroacoustics problem solved to date with this technique. The study was unable to detect any noticeable error in the solution, since noise levels predicted from these solution vectors are in excellent agreement with the noise levels computed from the exact solution. The equation solver provides a means for obtaining numerical solutions to aeroacoustics problems in three dimensions.
Special purpose parallel computer architecture for real-time control and simulation in robotic applications

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1993-01-01

This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Multiple core computer processor with globally-accessible local memories

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shalf, John; Donofrio, David; Oliker, Leonid

A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less
Scalable load balancing for massively parallel distributed Monte Carlo particle transport

DOE Office of Scientific and Technical Information (OSTI.GOV)

O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

2013-07-01

In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less
General purpose molecular dynamics simulations fully implemented on graphics processing units

NASA Astrophysics Data System (ADS)

Anderson, Joshua A.; Lorenz, Chris D.; Travesset, A.

2008-05-01

Graphics processing units (GPUs), originally developed for rendering real-time effects in computer games, now provide unprecedented computational power for scientific applications. In this paper, we develop a general purpose molecular dynamics code that runs entirely on a single GPU. It is shown that our GPU implementation provides a performance equivalent to that of fast 30 processor core distributed memory cluster. Our results show that GPUs already provide an inexpensive alternative to such clusters and discuss implications for the future.
Ultra-Reliable Digital Avionics (URDA) processor

NASA Astrophysics Data System (ADS)

Branstetter, Reagan; Ruszczyk, William; Miville, Frank

1994-10-01

Texas Instruments Incorporated (TI) developed the URDA processor design under contract with the U.S. Air Force Wright Laboratory and the U.S. Army Night Vision and Electro-Sensors Directorate. TI's approach couples advanced packaging solutions with advanced integrated circuit (IC) technology to provide a high-performance (200 MIPS/800 MFLOPS) modular avionics processor module for a wide range of avionics applications. TI's processor design integrates two Ada-programmable, URDA basic processor modules (BPM's) with a JIAWG-compatible PiBus and TMBus on a single F-22 common integrated processor-compatible form-factor SEM-E avionics card. A separate, high-speed (25-MWord/second 32-bit word) input/output bus is provided for sensor data. Each BPM provides a peak throughput of 100 MIPS scalar concurrent with 400-MFLOPS vector processing in a removable multichip module (MCM) mounted to a liquid-flowthrough (LFT) core and interfacing to a processor interface module printed wiring board (PWB). Commercial RISC technology coupled with TI's advanced bipolar complementary metal oxide semiconductor (BiCMOS) application specific integrated circuit (ASIC) and silicon-on-silicon packaging technologies are used to achieve the high performance in a miniaturized package. A Mips R4000-family reduced instruction set computer (RISC) processor and a TI 100-MHz BiCMOS vector coprocessor (VCP) ASIC provide, respectively, the 100 MIPS of a scalar processor throughput and 400 MFLOPS of vector processing throughput for each BPM. The TI Aladdim ASIC chipset was developed on the TI Aladdin Program under contract with the U.S. Army Communications and Electronics Command and was sponsored by the Advanced Research Projects Agency with technical direction from the U.S. Army Night Vision and Electro-Sensors Directorate.
Embedded Data Processor and Portable Computer Technology testbeds

NASA Technical Reports Server (NTRS)

Alena, Richard; Liu, Yuan-Kwei; Goforth, Andre; Fernquist, Alan R.

1993-01-01

Attention is given to current activities in the Embedded Data Processor and Portable Computer Technology testbed configurations that are part of the Advanced Data Systems Architectures Testbed at the Information Sciences Division at NASA Ames Research Center. The Embedded Data Processor Testbed evaluates advanced microprocessors for potential use in mission and payload applications within the Space Station Freedom Program. The Portable Computer Technology (PCT) Testbed integrates and demonstrates advanced portable computing devices and data system architectures. The PCT Testbed uses both commercial and custom-developed devices to demonstrate the feasibility of functional expansion and networking for portable computers in flight missions.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

NASA Technical Reports Server (NTRS)

Smith, T. B., Jr.; Lala, J. H.

1983-01-01

The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

PubMed Central

2010-01-01

Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics. PMID:20064262
A programmable two-qubit quantum processor in silicon

NASA Astrophysics Data System (ADS)

Watson, T. F.; Philips, S. G. J.; Kawakami, E.; Ward, D. R.; Scarlino, P.; Veldhorst, M.; Savage, D. E.; Lagally, M. G.; Friesen, Mark; Coppersmith, S. N.; Eriksson, M. A.; Vandersypen, L. M. K.

2018-03-01

Now that it is possible to achieve measurement and control fidelities for individual quantum bits (qubits) above the threshold for fault tolerance, attention is moving towards the difficult task of scaling up the number of physical qubits to the large numbers that are needed for fault-tolerant quantum computing. In this context, quantum-dot-based spin qubits could have substantial advantages over other types of qubit owing to their potential for all-electrical operation and ability to be integrated at high density onto an industrial platform. Initialization, readout and single- and two-qubit gates have been demonstrated in various quantum-dot-based qubit representations. However, as seen with small-scale demonstrations of quantum computers using other types of qubit, combining these elements leads to challenges related to qubit crosstalk, state leakage, calibration and control hardware. Here we overcome these challenges by using carefully designed control techniques to demonstrate a programmable two-qubit quantum processor in a silicon device that can perform the Deutsch–Josza algorithm and the Grover search algorithm—canonical examples of quantum algorithms that outperform their classical analogues. We characterize the entanglement in our processor by using quantum-state tomography of Bell states, measuring state fidelities of 85–89 per cent and concurrences of 73–82 per cent. These results pave the way for larger-scale quantum computers that use spins confined to quantum dots.
A programmable two-qubit quantum processor in silicon.

PubMed

Watson, T F; Philips, S G J; Kawakami, E; Ward, D R; Scarlino, P; Veldhorst, M; Savage, D E; Lagally, M G; Friesen, Mark; Coppersmith, S N; Eriksson, M A; Vandersypen, L M K

2018-03-29

Now that it is possible to achieve measurement and control fidelities for individual quantum bits (qubits) above the threshold for fault tolerance, attention is moving towards the difficult task of scaling up the number of physical qubits to the large numbers that are needed for fault-tolerant quantum computing. In this context, quantum-dot-based spin qubits could have substantial advantages over other types of qubit owing to their potential for all-electrical operation and ability to be integrated at high density onto an industrial platform. Initialization, readout and single- and two-qubit gates have been demonstrated in various quantum-dot-based qubit representations. However, as seen with small-scale demonstrations of quantum computers using other types of qubit, combining these elements leads to challenges related to qubit crosstalk, state leakage, calibration and control hardware. Here we overcome these challenges by using carefully designed control techniques to demonstrate a programmable two-qubit quantum processor in a silicon device that can perform the Deutsch-Josza algorithm and the Grover search algorithm-canonical examples of quantum algorithms that outperform their classical analogues. We characterize the entanglement in our processor by using quantum-state tomography of Bell states, measuring state fidelities of 85-89 per cent and concurrences of 73-82 per cent. These results pave the way for larger-scale quantum computers that use spins confined to quantum dots.
Holo-Chidi video concentrator card

NASA Astrophysics Data System (ADS)

Nwodoh, Thomas A.; Prabhakar, Aditya; Benton, Stephen A.

2001-12-01

The Holo-Chidi Video Concentrator Card is a frame buffer for the Holo-Chidi holographic video processing system. Holo- Chidi is designed at the MIT Media Laboratory for real-time computation of computer generated holograms and the subsequent display of the holograms at video frame rates. The Holo-Chidi system is made of two sets of cards - the set of Processor cards and the set of Video Concentrator Cards (VCCs). The Processor cards are used for hologram computation, data archival/retrieval from a host system, and for higher-level control of the VCCs. The VCC formats computed holographic data from multiple hologram computing Processor cards, converting the digital data to analog form to feed the acousto-optic-modulators of the Media lab's Mark-II holographic display system. The Video Concentrator card is made of: a High-Speed I/O (HSIO) interface whence data is transferred from the hologram computing Processor cards, a set of FIFOs and video RAM used as buffer for data for the hololines being displayed, a one-chip integrated microprocessor and peripheral combination that handles communication with other VCCs and furnishes the card with a USB port, a co-processor which controls display data formatting, and D-to-A converters that convert digital fringes to analog form. The co-processor is implemented with an SRAM-based FPGA with over 500,000 gates and controls all the signals needed to format the data from the multiple Processor cards into the format required by Mark-II. A VCC has three HSIO ports through which up to 500 Megabytes of computed holographic data can flow from the Processor Cards to the VCC per second. A Holo-Chidi system with three VCCs has enough frame buffering capacity to hold up to thirty two 36Megabyte hologram frames at a time. Pre-computed holograms may also be loaded into the VCC from a host computer through the low- speed USB port. Both the microprocessor and the co- processor in the VCC can access the main system memory used to store control programs and data for the VCC. The Card also generates the control signals used by the scanning mirrors of Mark-II. In this paper we discuss the design of the VCC and its implementation in the Holo-Chidi system.
Simulation of a master-slave event set processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Comfort, J.C.

1984-03-01

Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
Implementing direct, spatially isolated problems on transputer networks

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

Parametric studies were performed on transputer networks of up to 40 processors to determine how to implement and maximize the performance of the solution of problems where no processor-to-processor data transfer is required for the problem solution (spatially isolated). Two types of problems are investigated a computationally intensive problem where the solution required the transmission of 160 bytes of data through the parallel network, and a communication intensive example that required the transmission of 3 Mbytes of data through the network. This data consists of solutions being sent back to the host processor and not intermediate results for another processor to work on. Studies were performed on both integer and floating-point transputers. The latter features an on-chip floating-point math unit and offers approximately an order of magnitude performance increase over the integer transputer on real valued computations. The results indicate that a minimum amount of work is required on each node per communication to achieve high network speedups (efficiencies). The floating-point processor requires approximately an order of magnitude more work per communication than the integer processor because of the floating-point unit's increased computing capacity.
Master/Programmable-Slave Computer

NASA Technical Reports Server (NTRS)

Smaistrla, David; Hall, William A.

1990-01-01

Unique modular computer features compactness, low power, mass storage of data, multiprocessing, and choice of various input/output modes. Master processor communicates with user via usual keyboard and video display terminal. Coordinates operations of as many as 24 slave processors, each dedicated to different experiment. Each slave circuit card includes slave microprocessor and assortment of input/output circuits for communication with external equipment, with master processor, and with other slave processors. Adaptable to industrial process control with selectable degrees of automatic control, automatic and/or manual monitoring, and manual intervention.
SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processor in the form of a hybrid microcircuit

NASA Astrophysics Data System (ADS)

Evtikhiev, N. N.; Esepkina, N. A.; Dolgii, V. A.; Lavrov, A. P.; Khotyanov, B. M.; Chernokozhin, V. V.; Shestak, S. A.

1995-10-01

An optoelectronic processor in the form of a hybrid microcircuit is described. An analysis is made of the feasibility of developing a new class of optoelectronic processors which are hybrid microcircuits and can operate both as self-contained specialised computers and also as functional components of computing systems.
A Simple and Affordable TTL Processor for the Classroom

ERIC Educational Resources Information Center

Feinberg, Dave

2007-01-01

This paper presents a simple 4 bit computer processor design that may be built using TTL chips for less than $65. In addition to describing the processor itself in detail, we discuss our experience using the laboratory kit and its associated machine instruction set to teach computer architecture to high school students. (Contains 3 figures and 5…
Input data requirements for special processors in the computation system containing the VENTURE neutronics code. [LMFBR

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vondy, D.R.; Fowler, T.B.; Cunningham, G.W.

1979-07-01

User input data requirements are presented for certain special processors in a nuclear reactor computation system. These processors generally read data in formatted form and generate binary interface data files. Some data processing is done to convert from the user oriented form to the interface file forms. The VENTURE diffusion theory neutronics code and other computation modules in this system use the interface data files which are generated.
High Resolution Imaging Testbed Utilizing Sodium Laser Guide Star Adaptive Optics: The Real Time Wavefront Reconstructor Computer

DTIC Science & Technology

2008-07-31

Unlike the Lyrtech, each DSP on a Bittware board offers 3 MB of on-chip memory and 3 GFLOPs of 32-bit peak processing power. Based on the performance...Each NVIDIA 8800 Ultra features 576 GFLOPS on 128 612-MHz single-precision floating-point SIMD processors, arranged in 16 clusters of eight. Each
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU

NASA Astrophysics Data System (ADS)

Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.

2016-09-01

The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.

Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
Megatux

DOE Office of Scientific and Technical Information (OSTI.GOV)

2012-09-25

The Megatux platform enables the emulation of large scale (multi-million node) distributed systems. In particular, it allows for the emulation of large-scale networks interconnecting a very large number of emulated computer systems. It does this by leveraging virtualization and associated technologies to allow hundreds of virtual computers to be hosted on a single moderately sized server or workstation. Virtualization technology provided by modern processors allows for multiple guest OSs to run at the same time, sharing the hardware resources. The Megatux platform can be deployed on a single PC, a small cluster of a few boxes or a large clustermore » of computers. With a modest cluster, the Megatux platform can emulate complex organizational networks. By using virtualization, we emulate the hardware, but run actual software enabling large scale without sacrificing fidelity.« less
A high performance linear equation solver on the VPP500 parallel supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi

1994-12-31

This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
Symplectic multi-particle tracking on GPUs

NASA Astrophysics Data System (ADS)

Liu, Zhicong; Qiang, Ji

2018-05-01

A symplectic multi-particle tracking model is implemented on the Graphic Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) language. The symplectic tracking model can preserve phase space structure and reduce non-physical effects in long term simulation, which is important for beam property evaluation in particle accelerators. Though this model is computationally expensive, it is very suitable for parallelization and can be accelerated significantly by using GPUs. In this paper, we optimized the implementation of the symplectic tracking model on both single GPU and multiple GPUs. Using a single GPU processor, the code achieves a factor of 2-10 speedup for a range of problem sizes compared with the time on a single state-of-the-art Central Processing Unit (CPU) node with similar power consumption and semiconductor technology. It also shows good scalability on a multi-GPU cluster at Oak Ridge Leadership Computing Facility. In an application to beam dynamics simulation, the GPU implementation helps save more than a factor of two total computing time in comparison to the CPU implementation.
75 FR 69484 - Self-Regulatory Organizations; BATS Exchange, Inc.; NASDAQ OMX BX, Inc.; Chicago Board Options...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-11-12

... processor. In addition, the Amendment modifies the proposals so that a market maker's quoting obligations... reported by the responsible single plan processor. Finally, so that the markets may coordinate..., as reported by the responsible single plan processor. The Amendment also modifies that the market...
Effective Vectorization with OpenMP 4.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
Implementation and analysis of a Navier-Stokes algorithm on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1988-01-01

The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
A Course on Reconfigurable Processors

ERIC Educational Resources Information Center

Shoufan, Abdulhadi; Huss, Sorin A.

2010-01-01

Reconfigurable computing is an established field in computer science. Teaching this field to computer science students demands special attention due to limited student experience in electronics and digital system design. This article presents a compact course on reconfigurable processors, which was offered at the Technische Universitat Darmstadt,…
Effective correlator for RadioAstron project

NASA Astrophysics Data System (ADS)

Sergeev, Sergey

This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods

PubMed Central

Lee, Anthony; Yau, Christopher; Giles, Michael B.; Doucet, Arnaud; Holmes, Christopher C.

2011-01-01

We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. PMID:22003276
Preliminary study on the potential usefulness of array processor techniques for structural synthesis

NASA Technical Reports Server (NTRS)

Feeser, L. J.

1980-01-01

The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.
Solving very large, sparse linear systems on mesh-connected parallel computers

NASA Technical Reports Server (NTRS)

Opsahl, Torstein; Reif, John

1987-01-01

The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Parallel, Asynchronous Executive (PAX): System concepts, facilities, and architecture

NASA Technical Reports Server (NTRS)

Jones, W. H.

1983-01-01

The Parallel, Asynchronous Executive (PAX) is a software operating system simulation that allows many computers to work on a single problem at the same time. PAX is currently implemented on a UNIVAC 1100/42 computer system. Independent UNIVAC runstreams are used to simulate independent computers. Data are shared among independent UNIVAC runstreams through shared mass-storage files. PAX has achieved the following: (1) applied several computing processes simultaneously to a single, logically unified problem; (2) resolved most parallel processor conflicts by careful work assignment; (3) resolved by means of worker requests to PAX all conflicts not resolved by work assignment; (4) provided fault isolation and recovery mechanisms to meet the problems of an actual parallel, asynchronous processing machine. Additionally, one real-life problem has been constructed for the PAX environment. This is CASPER, a collection of aerodynamic and structural dynamic problem simulation routines. CASPER is not discussed in this report except to provide examples of parallel-processing techniques.
Static Schedulers for Embedded Real-Time Systems

DTIC Science & Technology

1989-12-01

Because of the need for having efficient scheduling algorithms in large scale real time systems , software engineers put a lot of effort on developing...provide static schedulers for he Embedded Real Time Systems with single processor using Ada programming language. The independent nonpreemptable...support the Computer Aided Rapid Prototyping for Embedded Real Time Systems so that we determine whether the system, as designed, meets the required
CMOS VLSI Layout and Verification of a SIMD Computer

NASA Technical Reports Server (NTRS)

Zheng, Jianqing

1996-01-01

A CMOS VLSI layout and verification of a 3 x 3 processor parallel computer has been completed. The layout was done using the MAGIC tool and the verification using HSPICE. Suggestions for expanding the computer into a million processor network are presented. Many problems that might be encountered when implementing a massively parallel computer are discussed.
Scalable Multiprocessor for High-Speed Computing in Space

NASA Technical Reports Server (NTRS)

Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard

2004-01-01

A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Replication of Space-Shuttle Computers in FPGAs and ASICs

NASA Technical Reports Server (NTRS)

Ferguson, Roscoe C.

2008-01-01

A document discusses the replication of the functionality of the onboard space-shuttle general-purpose computers (GPCs) in field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). The purpose of the replication effort is to enable utilization of proven space-shuttle flight software and software-development facilities to the extent possible during development of software for flight computers for a new generation of launch vehicles derived from the space shuttles. The replication involves specifying the instruction set of the central processing unit and the input/output processor (IOP) of the space-shuttle GPC in a hardware description language (HDL). The HDL is synthesized to form a "core" processor in an FPGA or, less preferably, in an ASIC. The core processor can be used to create a flight-control card to be inserted into a new avionics computer. The IOP of the GPC as implemented in the core processor could be designed to support data-bus protocols other than that of a multiplexer interface adapter (MIA) used in the space shuttle. Hence, a computer containing the core processor could be tailored to communicate via the space-shuttle GPC bus and/or one or more other buses.
Operation of commercial R3000 processors in the low earth orbit (LEO) space environment

NASA Astrophysics Data System (ADS)

Kaschmitter, J. L.; Shaeffer, D. L.; Colella, N. J.; McKnett, C. L.; Coakley, P. G.

1991-12-01

Spacecraft processors must operate with minimal degradation of performance in the LEO radiation environment, which includes the effects of total accumulated ionizing dose and single event phenomena (SEP) caused by protons and cosmic rays. Commercially available microprocessors can offer a number of advantages relative to radiation-hardened devices but are not normally designed to tolerate effects induced by the LEO environment. Extensive testing of the MIPS R3000 Reduced Instruction Set Computer (RISC) microprocessor family for operation in LEO environments is reported. The authors have characterized total dose and SEP effects for altitudes and inclinations of interest to systems operating in LEO, and they postulate techniques for detection and alleviation of SEP effects based on experimental results.
JETSPIN: A specific-purpose open-source software for simulations of nanofiber electrospinning

NASA Astrophysics Data System (ADS)

Lauricella, Marco; Pontrelli, Giuseppe; Coluzza, Ivan; Pisignano, Dario; Succi, Sauro

2015-12-01

We present the open-source computer program JETSPIN, specifically designed to simulate the electrospinning process of nanofibers. Its capabilities are shown with proper reference to the underlying model, as well as a description of the relevant input variables and associated test-case simulations. The various interactions included in the electrospinning model implemented in JETSPIN are discussed in detail. The code is designed to exploit different computational architectures, from single to parallel processor workstations. This paper provides an overview of JETSPIN, focusing primarily on its structure, parallel implementations, functionality, performance, and availability.
Algorithms and Libraries

NASA Technical Reports Server (NTRS)

Dongarra, Jack

1998-01-01

This exploratory study initiated our inquiry into algorithms and applications that would benefit by latency tolerant approach to algorithm building, including the construction of new algorithms where appropriate. In a multithreaded execution, when a processor reaches a point where remote memory access is necessary, the request is sent out on the network and a context--switch occurs to a new thread of computation. This effectively masks a long and unpredictable latency due to remote loads, thereby providing tolerance to remote access latency. We began to develop standards to profile various algorithm and application parameters, such as the degree of parallelism, granularity, precision, instruction set mix, interprocessor communication, latency etc. These tools will continue to develop and evolve as the Information Power Grid environment matures. To provide a richer context for this research, the project also focused on issues of fault-tolerance and computation migration of numerical algorithms and software. During the initial phase we tried to increase our understanding of the bottlenecks in single processor performance. Our work began by developing an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. Based on the results we achieved in this study we are planning to study other architectures of interest, including development of cost models, and developing code generators appropriate to these architectures.

A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor

PubMed Central

Tayara, Hilal; Ham, Woonchul; Chong, Kil To

2016-01-01

This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation. PMID:27983714
A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor.

PubMed

Tayara, Hilal; Ham, Woonchul; Chong, Kil To

2016-12-15

This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation.
Fundamental physics issues of multilevel logic in developing a parallel processor.

NASA Astrophysics Data System (ADS)

Bandyopadhyay, Anirban; Miki, Kazushi

2007-06-01

In the last century, On and Off physical switches, were equated with two decisions 0 and 1 to express every information in terms of binary digits and physically realize it in terms of switches connected in a circuit. Apart from memory-density increase significantly, more possible choices in particular space enables pattern-logic a reality, and manipulation of pattern would allow controlling logic, generating a new kind of processor. Neumann's computer is based on sequential logic, processing bits one by one. But as pattern-logic is generated on a surface, viewing whole pattern at a time is a truly parallel processing. Following Neumann's and Shannons fundamental thermodynamical approaches we have built compatible model based on series of single molecule based multibit logic systems of 4-12 bits in an UHV-STM. On their monolayer multilevel communication and pattern formation is experimentally verified. Furthermore, the developed intelligent monolayer is trained by Artificial Neural Network. Therefore fundamental weak interactions for the building of truly parallel processor are explored here physically and theoretically.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
An efficient parallel-processing method for transposing large matrices in place.

PubMed

Portnoff, M R

1999-01-01

We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.
Neuron splitting in compute-bound parallel network simulations enables runtime scaling with twice as many processors.

PubMed

Hines, Michael L; Eichner, Hubert; Schürmann, Felix

2008-08-01

Neuron tree topology equations can be split into two subtrees and solved on different processors with no change in accuracy, stability, or computational effort; communication costs involve only sending and receiving two double precision values by each subtree at each time step. Splitting cells is useful in attaining load balance in neural network simulations, especially when there is a wide range of cell sizes and the number of cells is about the same as the number of processors. For compute-bound simulations load balance results in almost ideal runtime scaling. Application of the cell splitting method to two published network models exhibits good runtime scaling on twice as many processors as could be effectively used with whole-cell balancing.
Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition

DOEpatents

Tomkins, James L [Albuquerque, NM; Camp, William J [Albuquerque, NM

2007-07-17

A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.
The computational structural mechanics testbed generic structural-element processor manual

NASA Technical Reports Server (NTRS)

Stanley, Gary M.; Nour-Omid, Shahram

1990-01-01

The usage and development of structural finite element processors based on the CSM Testbed's Generic Element Processor (GEP) template is documented. By convention, such processors have names of the form ESi, where i is an integer. This manual is therefore intended for both Testbed users who wish to invoke ES processors during the course of a structural analysis, and Testbed developers who wish to construct new element processors (or modify existing ones).
Microdot - A Four-Bit Microcontroller Designed for Distributed Low-End Computing in Satellites

NASA Astrophysics Data System (ADS)

2002-03-01

Many satellites are an integrated collection of sensors and actuators that require dedicated real-time control. For single processor systems, additional sensors require an increase in computing power and speed to provide the multi-tasking capability needed to service each sensor. Faster processors cost more and consume more power, which taxes a satellite's power resources and may lead to shorter satellite lifetimes. An alternative design approach is a distributed network of small and low power microcontrollers designed for space that handle the computing requirements of each individual sensor and actuator. The design of microdot, a four-bit microcontroller for distributed low-end computing, is presented. The design is based on previous research completed at the Space Electronics Branch, Air Force Research Laboratory (AFRL/VSSE) at Kirtland AFB, NM, and the Air Force Institute of Technology at Wright-Patterson AFB, OH. The Microdot has 29 instructions and a 1K x 4 instruction memory. The distributed computing architecture is based on the Philips Semiconductor I2C Serial Bus Protocol. A prototype was implemented and tested using an Altera Field Programmable Gate Array (FPGA). The prototype was operable to 9.1 MHz. The design was targeted for fabrication in a radiation-hardened-by-design gate-array cell library for the TSMC 0.35 micrometer CMOS process.
Stream Processors

NASA Astrophysics Data System (ADS)

Erez, Mattan; Dally, William J.

Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Transient Finite Element Computations on a Variable Transputer System

NASA Technical Reports Server (NTRS)

Smolinski, Patrick J.; Lapczyk, Ireneusz

1993-01-01

A parallel program to analyze transient finite element problems was written and implemented on a system of transputer processors. The program uses the explicit time integration algorithm which eliminates the need for equation solving, making it more suitable for parallel computations. An interprocessor communication scheme was developed for arbitrary two dimensional grid processor configurations. Several 3-D problems were analyzed on a system with a small number of processors.
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

DOE PAGES

Mittal, Sparsh; Vetter, Jeffrey S.

2015-04-24

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based onmore » their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.« less
CPU architecture for a fast and energy-saving calculation of convolution neural networks

NASA Astrophysics Data System (ADS)

Knoll, Florian J.; Grelcke, Michael; Czymmek, Vitali; Holtorf, Tim; Hussmann, Stephan

2017-06-01

One of the most difficult problem in the use of artificial neural networks is the computational capacity. Although large search engine companies own specially developed hardware to provide the necessary computing power, for the conventional user only remains the state of the art method, which is the use of a graphic processing unit (GPU) as a computational basis. Although these processors are well suited for large matrix computations, they need massive energy. Therefore a new processor on the basis of a field programmable gate array (FPGA) has been developed and is optimized for the application of deep learning. This processor is presented in this paper. The processor can be adapted for a particular application (in this paper to an organic farming application). The power consumption is only a fraction of a GPU application and should therefore be well suited for energy-saving applications.
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh; Vetter, Jeffrey S.

Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based onmore » their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.« less
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains.

PubMed

Jha, Ashwani; Flurchick, K M; Bikdash, Marwan; Kc, Dukka B

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10-15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors.
Parallel-SymD: A Parallel Approach to Detect Internal Symmetry in Protein Domains

PubMed Central

Jha, Ashwani; Flurchick, K. M.; Bikdash, Marwan

2016-01-01

Internally symmetric proteins are proteins that have a symmetrical structure in their monomeric single-chain form. Around 10–15% of the protein domains can be regarded as having some sort of internal symmetry. In this regard, we previously published SymD (symmetry detection), an algorithm that determines whether a given protein structure has internal symmetry by attempting to align the protein to its own copy after the copy is circularly permuted by all possible numbers of residues. SymD has proven to be a useful algorithm to detect symmetry. In this paper, we present a new parallelized algorithm called Parallel-SymD for detecting symmetry of proteins on clusters of computers. The achieved speedup of the new Parallel-SymD algorithm scales well with the number of computing processors. Scaling is better for proteins with a larger number of residues. For a protein of 509 residues, a speedup of 63 was achieved on a parallel system with 100 processors. PMID:27747230
Impacts of the IBM Cell Processor to Support Climate Models

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

2008-01-01

NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.
Platform-Independence and Scheduling In a Multi-Threaded Real-Time Simulation

NASA Technical Reports Server (NTRS)

Sugden, Paul P.; Rau, Melissa A.; Kenney, P. Sean

2001-01-01

Aviation research often relies on real-time, pilot-in-the-loop flight simulation as a means to develop new flight software, flight hardware, or pilot procedures. Often these simulations become so complex that a single processor is incapable of performing the necessary computations within a fixed time-step. Threads are an elegant means to distribute the computational work-load when running on a symmetric multi-processor machine. However, programming with threads often requires operating system specific calls that reduce code portability and maintainability. While a multi-threaded simulation allows a significant increase in the simulation complexity, it also increases the workload of a simulation operator by requiring that the operator determine which models run on which thread. To address these concerns an object-oriented design was implemented in the NASA Langley Standard Real-Time Simulation in C++ (LaSRS++) application framework. The design provides a portable and maintainable means to use threads and also provides a mechanism to automatically load balance the simulation models.
Bistatic scattering from a three-dimensional object above a two-dimensional randomly rough surface modeled with the parallel FDTD approach.

PubMed

Guo, L-X; Li, J; Zeng, H

2009-11-01

We present an investigation of the electromagnetic scattering from a three-dimensional (3-D) object above a two-dimensional (2-D) randomly rough surface. A Message Passing Interface-based parallel finite-difference time-domain (FDTD) approach is used, and the uniaxial perfectly matched layer (UPML) medium is adopted for truncation of the FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different number of processors is illustrated for one rough surface realization and shows that the computation time of our parallel FDTD algorithm is dramatically reduced relative to a single-processor implementation. Finally, the composite scattering coefficients versus scattered and azimuthal angle are presented and analyzed for different conditions, including the surface roughness, the dielectric constants, the polarization, and the size of the 3-D object.
Concurrent computation of attribute filters on shared memory parallel machines.

PubMed

Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

2008-10-01

Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.

Evaluation of reinitialization-free nonvolatile computer systems for energy-harvesting Internet of things applications

NASA Astrophysics Data System (ADS)

Onizawa, Naoya; Tamakoshi, Akira; Hanyu, Takahiro

2017-08-01

In this paper, reinitialization-free nonvolatile computer systems are designed and evaluated for energy-harvesting Internet of things (IoT) applications. In energy-harvesting applications, as power supplies generated from renewable power sources cause frequent power failures, data processed need to be backed up when power failures occur. Unless data are safely backed up before power supplies diminish, reinitialization processes are required when power supplies are recovered, which results in low energy efficiencies and slow operations. Using nonvolatile devices in processors and memories can realize a faster backup than a conventional volatile computer system, leading to a higher energy efficiency. To evaluate the energy efficiency upon frequent power failures, typical computer systems including processors and memories are designed using 90 nm CMOS or CMOS/magnetic tunnel junction (MTJ) technologies. Nonvolatile ARM Cortex-M0 processors with 4 kB MRAMs are evaluated using a typical computing benchmark program, Dhrystone, which shows a few order-of-magnitude reductions in energy in comparison with a volatile processor with SRAM.
Hardware based redundant multi-threading inside a GPU for improved reliability

DOEpatents

Sridharan, Vilas; Gurumurthi, Sudhanva

2015-05-05

A system and method for verifying computation output using computer hardware are provided. Instances of computation are generated and processed on hardware-based processors. As instances of computation are processed, each instance of computation receives a load accessible to other instances of computation. Instances of output are generated by processing the instances of computation. The instances of output are verified against each other in a hardware based processor to ensure accuracy of the output.
The Fermilab lattice supercomputer project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fischler, M.; Atac, R.; Cook, A.

1989-02-01

The ACPMAPS system is a highly cost effective, local memory MIMD computer targeted at algorithm development and production running for gauge theory on the lattice. The machine consists of a compound hypercube of crates, each of which is a full crossbar switch containing several processors. The processing nodes are single board array processors based on the Weitek XL chip set, each with a peak power of 20 MFLOPS and supported by 8 MBytes of data memory. The system currently being assembled has a peak power of 5 GFLOPS, delivering performance at approximately $250/MFLOP. The system is programmable in C andmore » Fortran. An underpinning of software routines (CANOPY) provides an easy and natural way of coding lattice problems, such that the details of parallelism, and communication and system architecture are transparent to the user. CANOPY can easily be ported to any single CPU or MIMD system which supports C, and allows the coding of typical applications with very little effort. 3 refs., 1 fig.« less
Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

NASA Technical Reports Server (NTRS)

Logan, Terry G.

1994-01-01

The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
Towards an Autonomic Cluster Management System (ACMS) with Reflex Autonomicity

NASA Technical Reports Server (NTRS)

Truszkowski, Walt; Hinchey, Mike; Sterritt, Roy

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of providing a fault-tolerant environment and achieving significant computational capabilities for high-performance computing applications. However, the task of manually managing and configuring a cluster quickly becomes daunting as the cluster grows in size. Autonomic computing, with its vision to provide self-management, can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Autonomic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management and its evolution to include reflex reactions via pulse monitoring.
Assignment Of Finite Elements To Parallel Processors

NASA Technical Reports Server (NTRS)

Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

1990-01-01

Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.
Multiprocessor computer overset grid method and apparatus

DOEpatents

Barnette, Daniel W.; Ober, Curtis C.

2003-01-01

A multiprocessor computer overset grid method and apparatus comprises associating points in each overset grid with processors and using mapped interpolation transformations to communicate intermediate values between processors assigned base and target points of the interpolation transformations. The method allows a multiprocessor computer to operate with effective load balance on overset grid applications.
Benchmark tests on the digital equipment corporation Alpha AXP 21164-based AlphaServer 8400, including a comparison of optimized vector and superscalar processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wasserman, H.J.

1996-02-01

The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
Enabling Future Robotic Missions with Multicore Processors

NASA Technical Reports Server (NTRS)

Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

2011-01-01

Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.
Performance Models for Split-execution Computing Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; McCaskey, Alex; Schrock, Jonathan

Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardwaremore » limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.« less
Method and apparatus of parallel computing with simultaneously operating stream prefetching and list prefetching engines

DOEpatents

Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan

2012-12-11

A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.
A high-accuracy optical linear algebra processor for finite element applications

NASA Technical Reports Server (NTRS)

Casasent, D.; Taylor, B. K.

1984-01-01

Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.
Optimal processor assignment for pipeline computations

NASA Technical Reports Server (NTRS)

Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

1991-01-01

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
A novel VLSI processor architecture for supercomputing arrays

NASA Technical Reports Server (NTRS)

Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.

1993-01-01

Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.
Detailed description of the HP-9825A HFRMP trajectory processor (TRAJ)

NASA Technical Reports Server (NTRS)

Kindall, S. M.; Wilson, S. W.

1979-01-01

The computer code for the trajectory processor of the HP-9825A High Fidelity Relative Motion Program is described in detail. The processor is a 12-degrees-of-freedom trajectory integrator which can be used to generate digital and graphical data describing the relative motion of the Space Shuttle Orbiter and a free-flying cylindrical payload. Coding standards and flow charts are given and the computational logic is discussed.
Rational calculation accuracy in acousto-optical matrix-vector processor

NASA Astrophysics Data System (ADS)

Oparin, V. V.; Tigin, Dmitry V.

1994-01-01

The high speed of parallel computations for a comparatively small-size processor and acceptable power consumption makes the usage of acousto-optic matrix-vector multiplier (AOMVM) attractive for processing of large amounts of information in real time. The limited accuracy of computations is an essential disadvantage of such a processor. The reduced accuracy requirements allow for considerable simplification of the AOMVM architecture and the reduction of the demands on its components.
Parallel Ray Tracing Using the Message Passing Interface

DTIC Science & Technology

2007-09-01

software is available for lens design and for general optical systems modeling. It tends to be designed to run on a single processor and can be very...Cameron, Senior Member, IEEE Abstract—Ray-tracing software is available for lens design and for general optical systems modeling. It tends to be designed to...National Aeronautics and Space Administration (NASA), optical ray tracing, parallel computing, parallel pro- cessing, prime numbers, ray tracing
System on a chip with MPEG-4 capability

NASA Astrophysics Data System (ADS)

Yassa, Fathy; Schonfeld, Dan

2002-12-01

Current products supporting video communication applications rely on existing computer architectures. RISC processors have been used successfully in numerous applications over several decades. DSP processors have become ubiquitous in signal processing and communication applications. Real-time applications such as speech processing in cellular telephony rely extensively on the computational power of these processors. Video processors designed to implement the computationally intensive codec operations have also been used to address the high demands of video communication applications (e.g., cable set-top boxes and DVDs). This paper presents an overview of a system-on-chip (SOC) architecture used for real-time video in wireless communication applications. The SOC specifications answer to the system requirements imposed by the application environment. A CAM-based video processor is used to accelerate data intensive video compression tasks such as motion estimations and filtering. Other components are dedicated to system level data processing and audio processing. A rich set of I/Os allows the SOC to communicate with other system components such as baseband and memory subsystems.
JPRS Report, Science & Technology, Europe.

DTIC Science & Technology

1991-04-30

processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is

Satellite on-board real-time SAR processor prototype

NASA Astrophysics Data System (ADS)

Bergeron, Alain; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Marchese, Linda; Bourqui, Pascal; Desnoyers, Nicholas; Legros, Mathieu; Guillot, Ludovic; Mercier, Luc; Châteauneuf, François

2017-11-01

A Compact Real-Time Optronic SAR Processor has been successfully developed and tested up to a Technology Readiness Level of 4 (TRL4), the breadboard validation in a laboratory environment. SAR, or Synthetic Aperture Radar, is an active system allowing day and night imaging independent of the cloud coverage of the planet. The SAR raw data is a set of complex data for range and azimuth, which cannot be compressed. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Originally the first SAR images were optically processed. The optical Fourier processor architecture provides inherent parallel computing capabilities allowing real-time SAR data processing and thus the ability for compression and strongly reduced communication bandwidth requirements for the satellite. SAR signal return data are in general complex data. Both amplitude and phase must be combined optically in the SAR processor for each range and azimuth pixel. Amplitude and phase are generated by dedicated spatial light modulators and superimposed by an optical relay set-up. The spatial light modulators display the full complex raw data information over a two-dimensional format, one for the azimuth and one for the range. Since the entire signal history is displayed at once, the processor operates in parallel yielding real-time performances, i.e. without resulting bottleneck. Processing of both azimuth and range information is performed in a single pass. This paper focuses on the onboard capabilities of the compact optical SAR processor prototype that allows in-orbit processing of SAR images. Examples of processed ENVISAT ASAR images are presented. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed.
Manyscale Computing for Sensor Processing in Support of Space Situational Awareness

NASA Astrophysics Data System (ADS)

Schmalz, M.; Chapman, W.; Hayden, E.; Sahni, S.; Ranka, S.

2014-09-01

Increasing image and signal data burden associated with sensor data processing in support of space situational awareness implies continuing computational throughput growth beyond the petascale regime. In addition to growing applications data burden and diversity, the breadth, diversity and scalability of high performance computing architectures and their various organizations challenge the development of a single, unifying, practicable model of parallel computation. Therefore, models for scalable parallel processing have exploited architectural and structural idiosyncrasies, yielding potential misapplications when legacy programs are ported among such architectures. In response to this challenge, we have developed a concise, efficient computational paradigm and software called Manyscale Computing to facilitate efficient mapping of annotated application codes to heterogeneous parallel architectures. Our theory, algorithms, software, and experimental results support partitioning and scheduling of application codes for envisioned parallel architectures, in terms of work atoms that are mapped (for example) to threads or thread blocks on computational hardware. Because of the rigor, completeness, conciseness, and layered design of our manyscale approach, application-to-architecture mapping is feasible and scalable for architectures at petascales, exascales, and above. Further, our methodology is simple, relying primarily on a small set of primitive mapping operations and support routines that are readily implemented on modern parallel processors such as graphics processing units (GPUs) and hybrid multi-processors (HMPs). In this paper, we overview the opportunities and challenges of manyscale computing for image and signal processing in support of space situational awareness applications. We discuss applications in terms of a layered hardware architecture (laboratory > supercomputer > rack > processor > component hierarchy). Demonstration applications include performance analysis and results in terms of execution time as well as storage, power, and energy consumption for bus-connected and/or networked architectures. The feasibility of the manyscale paradigm is demonstrated by addressing four principal challenges: (1) architectural/structural diversity, parallelism, and locality, (2) masking of I/O and memory latencies, (3) scalability of design as well as implementation, and (4) efficient representation/expression of parallel applications. Examples will demonstrate how manyscale computing helps solve these challenges efficiently on real-world computing systems.
Method for simultaneous overlapped communications between neighboring processors in a multiple

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1991-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
Orthorectification by Using Gpgpu Method

NASA Astrophysics Data System (ADS)

Sahin, H.; Kulur, S.

2012-07-01

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
A Parallel Vector Machine for the PM Programming Language

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2016-04-01

PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using standard OpenMP and MPI. Performance analyses of the PM vector machine, demonstrating its scaling properties with respect to domain size and the number of processor nodes will be presented for a range of hardware configurations. The PM software and language definition are being made available under unrestrictive MIT and Creative Commons Attribution licenses respectively: www.pm-lang.org.
Analog Processor To Solve Optimization Problems

NASA Technical Reports Server (NTRS)

Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

1993-01-01

Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.
Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

NASA Astrophysics Data System (ADS)

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
Time-partitioning simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Milner, Edward J.; Blech, Richard A.; Chima, Rodrick V.

1987-01-01

A technique allowing time-staggered solution of partial differential equations is presented in this report. Using this technique, called time-partitioning, simulation execution speedup is proportional to the number of processors used because all processors operate simultaneously, with each updating of the solution grid at a different time point. The technique is limited by neither the number of processors available nor by the dimension of the solution grid. Time-partitioning was used to obtain the flow pattern through a cascade of airfoils, modeled by the Euler partial differential equations. An execution speedup factor of 1.77 was achieved using a two processor Cray X-MP/24 computer.
The computational structural mechanics testbed architecture. Volume 2: The interface

NASA Technical Reports Server (NTRS)

Felippa, Carlos A.

1988-01-01

This is the third set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 3 describes the CLIP-Processor interface and related topics. It is intended only for processor developers.
A pipeline VLSI design of fast singular value decomposition processor for real-time EEG system based on on-line recursive independent component analysis.

PubMed

Huang, Kuan-Ju; Shih, Wei-Yeh; Chang, Jui Chung; Feng, Chih Wei; Fang, Wai-Chi

2013-01-01

This paper presents a pipeline VLSI design of fast singular value decomposition (SVD) processor for real-time electroencephalography (EEG) system based on on-line recursive independent component analysis (ORICA). Since SVD is used frequently in computations of the real-time EEG system, a low-latency and high-accuracy SVD processor is essential. During the EEG system process, the proposed SVD processor aims to solve the diagonal, inverse and inverse square root matrices of the target matrices in real time. Generally, SVD requires a huge amount of computation in hardware implementation. Therefore, this work proposes a novel design concept for data flow updating to assist the pipeline VLSI implementation. The SVD processor can greatly improve the feasibility of real-time EEG system applications such as brain computer interfaces (BCIs). The proposed architecture is implemented using TSMC 90 nm CMOS technology. The sample rate of EEG raw data adopts 128 Hz. The core size of the SVD processor is 580×580 um(2), and the speed of operation frequency is 20MHz. It consumes 0.774mW of power during the 8-channel EEG system per execution time.
Fault tolerance in a supercomputer through dynamic repartitioning

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

2007-02-27

A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Fault-Tolerant, Real-Time, Multi-Core Computer System

NASA Technical Reports Server (NTRS)

Gostelow, Kim P.

2012-01-01

A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
Benchmarking NWP Kernels on Multi- and Many-core Processors

NASA Astrophysics Data System (ADS)

Michalakes, J.; Vachharajani, M.

2008-12-01

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

NASA Astrophysics Data System (ADS)

Amooie, M. A.; Moortgat, J.

2017-12-01

We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Partitioning problems in parallel, pipelined and distributed computing

NASA Technical Reports Server (NTRS)

Bokhari, S.

1985-01-01

The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
A VME-based software trigger system using UNIX processors

NASA Astrophysics Data System (ADS)

Atmur, Robert; Connor, David F.; Molzon, William

1997-02-01

We have constructed a distributed computing platform with eight processors to assemble and filter data from digitization crates. The filtered data were transported to a tape-writing UNIX computer via ethernet. Each processor ran a UNIX operating system and was installed in its own VME crate. Each VME crate contained dual-port memories which interfaced with the digitizers. Using standard hardware and software (VME and UNIX) allows us to select from a wide variety of non-proprietary products and makes upgrades simpler, if they are necessary.
Noncoherent parallel optical processor for discrete two-dimensional linear transformations.

PubMed

Glaser, I

1980-10-01

We describe a parallel optical processor, based on a lenslet array, that provides general linear two-dimensional transformations using noncoherent light. Such a processor could become useful in image- and signal-processing applications in which the throughput requirements cannot be adequately satisfied by state-of-the-art digital processors. Experimental results that illustrate the feasibility of the processor by demonstrating its use in parallel optical computation of the two-dimensional Walsh-Hadamard transformation are presented.
Load Balancing Scientific Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pearce, Olga Tkachyshyn

2014-12-01

The largest supercomputers have millions of independent processors, and concurrency levels are rapidly increasing. For ideal efficiency, developers of the simulations that run on these machines must ensure that computational work is evenly balanced among processors. Assigning work evenly is challenging because many large modern parallel codes simulate behavior of physical systems that evolve over time, and their workloads change over time. Furthermore, the cost of imbalanced load increases with scale because most large-scale scientific simulations today use a Single Program Multiple Data (SPMD) parallel programming model, and an increasing number of processors will wait for the slowest one atmore » the synchronization points. To address load imbalance, many large-scale parallel applications use dynamic load balance algorithms to redistribute work evenly. The research objective of this dissertation is to develop methods to decide when and how to load balance the application, and to balance it effectively and affordably. We measure and evaluate the computational load of the application, and develop strategies to decide when and how to correct the imbalance. Depending on the simulation, a fast, local load balance algorithm may be suitable, or a more sophisticated and expensive algorithm may be required. We developed a model for comparison of load balance algorithms for a specific state of the simulation that enables the selection of a balancing algorithm that will minimize overall runtime.« less

PS3 CELL Development for Scientific Computation and Research

NASA Astrophysics Data System (ADS)

Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.

2007-12-01

The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.
Multicore Programming Challenges

NASA Astrophysics Data System (ADS)

Perrone, Michael

The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
Polymorphous computing fabric

DOEpatents

Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM

2011-01-18

Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
Support for Diagnosis of Custom Computer Hardware

NASA Technical Reports Server (NTRS)

Molock, Dwaine S.

2008-01-01

The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response

NASA Technical Reports Server (NTRS)

Abdi, Frank

1996-01-01

A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
Rectangular Array Of Digital Processors For Planning Paths

NASA Technical Reports Server (NTRS)

Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

1993-01-01

Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.
Symplectic molecular dynamics simulations on specially designed parallel computers.

PubMed

Borstnik, Urban; Janezic, Dusanka

2005-01-01

We have developed a computer program for molecular dynamics (MD) simulation that implements the Split Integration Symplectic Method (SISM) and is designed to run on specialized parallel computers. The MD integration is performed by the SISM, which analytically treats high-frequency vibrational motion and thus enables the use of longer simulation time steps. The low-frequency motion is treated numerically on specially designed parallel computers, which decreases the computational time of each simulation time step. The combination of these approaches means that less time is required and fewer steps are needed and so enables fast MD simulations. We study the computational performance of MD simulation of molecular systems on specialized computers and provide a comparison to standard personal computers. The combination of the SISM with two specialized parallel computers is an effective way to increase the speed of MD simulations up to 16-fold over a single PC processor.
Multitask neurovision processor with extensive feedback and feedforward connections

NASA Astrophysics Data System (ADS)

Gupta, Madan M.; Knopf, George K.

1991-11-01

A multi-task neuro-vision parameter which performs a variety of information processing operations associated with the early stages of biological vision is presented. The network architecture of this neuro-vision processor, called the positive-negative (PN) neural processor, is loosely based on the neural activity fields exhibited by thalamic and cortical nervous tissue layers. The computational operation performed by the processor arises from the strength of the recurrent feedback among the numerous positive and negative neural computing units. By adjusting the feedback connections it is possible to generate diverse dynamic behavior that may be used for short-term visual memory (STVM), spatio-temporal filtering (STF), and pulse frequency modulation (PFM). The information attributes that are to be processes may be regulated by modifying the feedforward connections from the signal space to the neural processor.
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

NASA Astrophysics Data System (ADS)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.; Papka, M. E.; Benjamin, D. P.

2017-01-01

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application and the performance that was achieved.
Implementing Access to Data Distributed on Many Processors

NASA Technical Reports Server (NTRS)

James, Mark

2006-01-01

A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization.
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmalz, Mark S

2011-07-24

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more » for a high-performance architecture. Key computational and data movement kernels of the application were analyzed/optimized for parallel execution using the mapping G {yields} {und G}, which can be performed semi-automatically. This approach is widely applicable to many types of high-performance computing systems, such as graphics processing units or clusters comprised of nodes that contain one or more such units. Phase I Accomplishments - Phase I research decomposed/profiled computational particle dynamics simulation code for rocket fuel combustion into low and high computational cost regions (respectively, mainly sequential and mainly parallel kernels), with analysis of space and time complexity. Using the research team's expertise in algorithm-to-architecture mappings, the high-cost kernels were transformed, parallelized, and implemented on Nvidia Fermi GPUs. Measured speedups (GPU with respect to single-core CPU) were approximately 20-32X for realistic model parameters, without final optimization. Error analysis showed no loss of computational accuracy. Commercial Applications and Other Benefits - The proposed research will constitute a breakthrough in solution of problems related to efficient parallel computation of particle and fluid dynamics simulations. These problems occur throughout DOE, military and commercial sectors: the potential payoff is high. We plan to license or sell the solution to contractors for military and domestic applications such as disaster simulation (aerodynamic and hydrodynamic), Government agencies (hydrological and environmental simulations), and medical applications (e.g., in tomographic image reconstruction). Keywords - High-performance Computing, Graphic Processing Unit, Fluid/Particle Simulation. Summary for Members of Congress - Department of Energy has many simulation codes that must compute faster, to be effective. The Phase I research parallelized particle/fluid simulations for rocket combustion, for high-performance computing systems.« less
Parallel computing on Unix workstation arrays

NASA Astrophysics Data System (ADS)

Reale, F.; Bocchino, F.; Sciortino, S.

1994-12-01

We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.
Computer program documentation for the pasture/range condition assessment processor

NASA Technical Reports Server (NTRS)

Mcintyre, K. S.; Miller, T. G. (Principal Investigator)

1982-01-01

The processor which drives for the RANGE software allows the user to analyze LANDSAT data containing pasture and rangeland. Analysis includes mapping, generating statistics, calculating vegetative indexes, and plotting vegetative indexes. Routines for using the processor are given. A flow diagram is included.
Potential of minicomputer/array-processor system for nonlinear finite-element analysis

NASA Technical Reports Server (NTRS)

Strohkorb, G. A.; Noor, A. K.

1983-01-01

The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.
Next Generation Space Telescope Integrated Science Module Data System

NASA Technical Reports Server (NTRS)

Schnurr, Richard G.; Greenhouse, Matthew A.; Jurotich, Matthew M.; Whitley, Raymond; Kalinowski, Keith J.; Love, Bruce W.; Travis, Jeffrey W.; Long, Knox S.

1999-01-01

The Data system for the Next Generation Space Telescope (NGST) Integrated Science Module (ISIM) is the primary data interface between the spacecraft, telescope, and science instrument systems. This poster includes block diagrams of the ISIM data system and its components derived during the pre-phase A Yardstick feasibility study. The poster details the hardware and software components used to acquire and process science data for the Yardstick instrument compliment, and depicts the baseline external interfaces to science instruments and other systems. This baseline data system is a fully redundant, high performance computing system. Each redundant computer contains three 150 MHz power PC processors. All processors execute a commercially available real time multi-tasking operating system supporting, preemptive multi-tasking, file management and network interfaces. These six processors in the system are networked together. The spacecraft interface baseline is an extension of the network, which links the six processors. The final selection for Processor busses, processor chips, network interfaces, and high-speed data interfaces will be made during mid 2002.
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

NASA Astrophysics Data System (ADS)

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-04-01

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Sohn, Andrew

1996-01-01

Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.
The MasPar MP-1 As a Computer Arithmetic Laboratory

PubMed Central

Anuta, Michael A.; Lozier, Daniel W.; Turner, Peter R.

1996-01-01

This paper is a blueprint for the use of a massively parallel SIMD computer architecture for the simulation of various forms of computer arithmetic. The particular system used is a DEC/MasPar MP-1 with 4096 processors in a square array. This architecture has many advantages for such simulations due largely to the simplicity of the individual processors. Arithmetic operations can be spread across the processor array to simulate a hardware chip. Alternatively they may be performed on individual processors to allow simulation of a massively parallel implementation of the arithmetic. Compromises between these extremes permit speed-area tradeoffs to be examined. The paper includes a description of the architecture and its features. It then summarizes some of the arithmetic systems which have been, or are to be, implemented. The implementation of the level-index and symmetric level-index, LI and SLI, systems is described in some detail. An extensive bibliography is included. PMID:27805123
Discovering Motifs in Biological Sequences Using the Micron Automata Processor.

PubMed

Roy, Indranil; Aluru, Srinivas

2016-01-01

Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.

Design and evaluation of an architecture for a digital signal processor for instrumentation applications

NASA Astrophysics Data System (ADS)

Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos

1990-03-01

The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
Transputer parallel processing at NASA Lewis Research Center

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1989-01-01

The transputer parallel processing lab at NASA Lewis Research Center (LeRC) consists of 69 processors (transputers) that can be connected into various networks for use in general purpose concurrent processing applications. The main goal of the lab is to develop concurrent scientific and engineering application programs that will take advantage of the computational speed increases available on a parallel processor over the traditional sequential processor. Current research involves the development of basic programming tools. These tools will help standardize program interfaces to specific hardware by providing a set of common libraries for applications programmers. The thrust of the current effort is in developing a set of tools for graphics rendering/animation. The applications programmer currently has two options for on-screen plotting. One option can be used for static graphics displays and the other can be used for animated motion. The option for static display involves the use of 2-D graphics primitives that can be called from within an application program. These routines perform the standard 2-D geometric graphics operations in real-coordinate space as well as allowing multiple windows on a single screen.
CoNNeCT Baseband Processor Module

NASA Technical Reports Server (NTRS)

Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.

2011-01-01

A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.
A Electro-Optical Image Algebra Processing System for Automatic Target Recognition

NASA Astrophysics Data System (ADS)

Coffield, Patrick Cyrus

The proposed electro-optical image algebra processing system is designed specifically for image processing and other related computations. The design is a hybridization of an optical correlator and a massively paralleled, single instruction multiple data processor. The architecture of the design consists of three tightly coupled components: a spatial configuration processor (the optical analog portion), a weighting processor (digital), and an accumulation processor (digital). The systolic flow of data and image processing operations are directed by a control buffer and pipelined to each of the three processing components. The image processing operations are defined in terms of basic operations of an image algebra developed by the University of Florida. The algebra is capable of describing all common image-to-image transformations. The merit of this architectural design is how it implements the natural decomposition of algebraic functions into spatially distributed, point use operations. The effect of this particular decomposition allows convolution type operations to be computed strictly as a function of the number of elements in the template (mask, filter, etc.) instead of the number of picture elements in the image. Thus, a substantial increase in throughput is realized. The implementation of the proposed design may be accomplished in many ways. While a hybrid electro-optical implementation is of primary interest, the benefits and design issues of an all digital implementation are also discussed. The potential utility of this architectural design lies in its ability to control a large variety of the arithmetic and logic operations of the image algebra's generalized matrix product. The generalized matrix product is the most powerful fundamental operation in the algebra, thus allowing a wide range of applications. No other known device or design has made this claim of processing speed and general implementation of a heterogeneous image algebra.
Bringing Algorithms to Life: Cooperative Computing Activities Using Students as Processors.

ERIC Educational Resources Information Center

Bachelis, Gregory F.; And Others

1994-01-01

Presents cooperative computing activities in which each student plays the role of a switch or processor and acts out algorithms. Includes binary counting, finding the smallest card in a deck, sorting by selection and merging, adding and multiplying large numbers, and sieving for primes. (16 references) (Author/MKR)
Advanced flight computers for planetary exploration

NASA Technical Reports Server (NTRS)

Stephenson, R. Rhoads

1988-01-01

Research concerning flight computers for use on interplanetary probes is reviewed. The history of these computers from the Viking mission to the present is outlined. The differences between ground commercial computers and computers for planetary exploration are listed. The development of a computer for the Mariner Mark II comet rendezvous asteroid flyby mission is described. Various aspects of recently developed computer systems are examined, including the Max real time, embedded computer, a hypercube distributed supercomputer, a SAR data processor, a processor for the High Resolution IR Imaging Spectrometer, and a robotic vision multiresolution pyramid machine for processsing images obtained by a Mars Rover.
Visualization of unsteady computational fluid dynamics

NASA Astrophysics Data System (ADS)

Haimes, Robert

1994-11-01

A brief summary of the computer environment used for calculating three dimensional unsteady Computational Fluid Dynamic (CFD) results is presented. This environment requires a super computer as well as massively parallel processors (MPP's) and clusters of workstations acting as a single MPP (by concurrently working on the same task) provide the required computational bandwidth for CFD calculations of transient problems. The cluster of reduced instruction set computers (RISC) is a recent advent based on the low cost and high performance that workstation vendors provide. The cluster, with the proper software can act as a multiple instruction/multiple data (MIMD) machine. A new set of software tools is being designed specifically to address visualizing 3D unsteady CFD results in these environments. Three user's manuals for the parallel version of Visual3, pV3, revision 1.00 make up the bulk of this report.
Visualization of unsteady computational fluid dynamics

NASA Technical Reports Server (NTRS)

Haimes, Robert

1994-01-01

A brief summary of the computer environment used for calculating three dimensional unsteady Computational Fluid Dynamic (CFD) results is presented. This environment requires a super computer as well as massively parallel processors (MPP's) and clusters of workstations acting as a single MPP (by concurrently working on the same task) provide the required computational bandwidth for CFD calculations of transient problems. The cluster of reduced instruction set computers (RISC) is a recent advent based on the low cost and high performance that workstation vendors provide. The cluster, with the proper software can act as a multiple instruction/multiple data (MIMD) machine. A new set of software tools is being designed specifically to address visualizing 3D unsteady CFD results in these environments. Three user's manuals for the parallel version of Visual3, pV3, revision 1.00 make up the bulk of this report.
Demonstration of two-qubit algorithms with a superconducting quantum processor.

PubMed

DiCarlo, L; Chow, J M; Gambetta, J M; Bishop, Lev S; Johnson, B R; Schuster, D I; Majer, J; Blais, A; Frunzio, L; Girvin, S M; Schoelkopf, R J

2009-07-09

Quantum computers, which harness the superposition and entanglement of physical states, could outperform their classical counterparts in solving problems with technological impact-such as factoring large numbers and searching databases. A quantum processor executes algorithms by applying a programmable sequence of gates to an initialized register of qubits, which coherently evolves into a final state containing the result of the computation. Building a quantum processor is challenging because of the need to meet simultaneously requirements that are in conflict: state preparation, long coherence times, universal gate operations and qubit readout. Processors based on a few qubits have been demonstrated using nuclear magnetic resonance, cold ion trap and optical systems, but a solid-state realization has remained an outstanding challenge. Here we demonstrate a two-qubit superconducting processor and the implementation of the Grover search and Deutsch-Jozsa quantum algorithms. We use a two-qubit interaction, tunable in strength by two orders of magnitude on nanosecond timescales, which is mediated by a cavity bus in a circuit quantum electrodynamics architecture. This interaction allows the generation of highly entangled states with concurrence up to 94 per cent. Although this processor constitutes an important step in quantum computing with integrated circuits, continuing efforts to increase qubit coherence times, gate performance and register size will be required to fulfil the promise of a scalable technology.
Fast 2D FWI on a multi and many-cores workstation.

NASA Astrophysics Data System (ADS)

Thierry, Philippe; Donno, Daniela; Noble, Mark

2014-05-01

Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.
Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
The CSM testbed matrix processors internal logic and dataflow descriptions

NASA Technical Reports Server (NTRS)

Regelbrugge, Marc E.; Wright, Mary A.

1988-01-01

This report constitutes the final report for subtask 1 of Task 5 of NASA Contract NAS1-18444, Computational Structural Mechanics (CSM) Research. This report contains a detailed description of the coded workings of selected CSM Testbed matrix processors (i.e., TOPO, K, INV, SSOL) and of the arithmetic utility processor AUS. These processors and the current sparse matrix data structures are studied and documented. Items examined include: details of the data structures, interdependence of data structures, data-blocking logic in the data structures, processor data flow and architecture, and processor algorithmic logic flow.
Addressing the challenges of standalone multi-core simulations in molecular dynamics

NASA Astrophysics Data System (ADS)

Ocaya, R. O.; Terblans, J. J.

2017-07-01

Computational modelling in material science involves mathematical abstractions of force fields between particles with the aim to postulate, develop and understand materials by simulation. The aggregated pairwise interactions of the material's particles lead to a deduction of its macroscopic behaviours. For practically meaningful macroscopic scales, a large amount of data are generated, leading to vast execution times. Simulation times of hours, days or weeks for moderately sized problems are not uncommon. The reduction of simulation times, improved result accuracy and the associated software and hardware engineering challenges are the main motivations for many of the ongoing researches in the computational sciences. This contribution is concerned mainly with simulations that can be done on a "standalone" computer based on Message Passing Interfaces (MPI), parallel code running on hardware platforms with wide specifications, such as single/multi- processor, multi-core machines with minimal reconfiguration for upward scaling of computational power. The widely available, documented and standardized MPI library provides this functionality through the MPI_Comm_size (), MPI_Comm_rank () and MPI_Reduce () functions. A survey of the literature shows that relatively little is written with respect to the efficient extraction of the inherent computational power in a cluster. In this work, we discuss the main avenues available to tap into this extra power without compromising computational accuracy. We also present methods to overcome the high inertia encountered in single-node-based computational molecular dynamics. We begin by surveying the current state of the art and discuss what it takes to achieve parallelism, efficiency and enhanced computational accuracy through program threads and message passing interfaces. Several code illustrations are given. The pros and cons of writing raw code as opposed to using heuristic, third-party code are also discussed. The growing trend towards graphical processor units and virtual computing clouds for high-performance computing is also discussed. Finally, we present the comparative results of vacancy formation energy calculations using our own parallelized standalone code called Verlet-Stormer velocity (VSV) operating on 30,000 copper atoms. The code is based on the Sutton-Chen implementation of the Finnis-Sinclair pairwise embedded atom potential. A link to the code is also given.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G; Salapura, Valentina

2014-12-02

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Optimistic barrier synchronization

NASA Technical Reports Server (NTRS)

Nicol, David M.

1992-01-01

Barrier synchronization is fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all the work required of it prior to synchronization. The alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all the necessary pre-synchronization computation, is treated. The problem arises when the number of pre-sychronization messages to be received by a processor is unkown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We describe an optimistic O(log sup 2 P) barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions as well as associative parallel prefix computations.
Unaligned instruction relocation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unalignedmore » ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.« less
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures

NASA Technical Reports Server (NTRS)

Ma, Kwan-Liu

1995-01-01

As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
Unaligned instruction relocation

DOEpatents

Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.; Sura, Zehra N.

2018-01-23

In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unaligned ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.
Extending the granularity of representation and control for the MIL-STD CAIS 1.0 node model

NASA Technical Reports Server (NTRS)

Rogers, Kathy L.

1986-01-01

The Common APSE (Ada 1 Program Support Environment) Interface Set (CAIS) (DoD85) node model provides an excellent baseline for interfaces in a single-host development environment. To encompass the entire spectrum of computing, however, the CAIS model should be extended in four areas. It should provide the interface between the engineering workstation and the host system throughout the entire lifecycle of the system. It should provide a basis for communication and integration functions needed by distributed host environments. It should provide common interfaces for communications mechanisms to and among target processors. It should provide facilities for integration, validation, and verification of test beds extending to distributed systems on geographically separate processors with heterogeneous instruction set architectures (ISAS). Additions to the PROCESS NODE model to extend the CAIS into these four areas are proposed.

SAPIENS: Spreading Activation Processor for Information Encoded in Network Structures. Technical Report No. 296.

ERIC Educational Resources Information Center

Ortony, Andrew; Radin, Dean I.

The product of researchers' efforts to develop a computer processor which distinguishes between relevant and irrelevant information in the database, Spreading Activation Processor for Information Encoded in Network Structures (SAPIENS) exhibits (1) context sensitivity, (2) efficiency, (3) decreasing activation over time, (4) summation of…
Global synchronization of parallel processors using clock pulse width modulation

DOEpatents

Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

2013-04-02

A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.
The mathematical theory of signal processing and compression-designs

NASA Astrophysics Data System (ADS)

Feria, Erlan H.

2006-05-01

The mathematical theory of signal processing, named processor coding, will be shown to inherently arise as the computational time dual of Shannon's mathematical theory of communication which is also known as source coding. Source coding is concerned with signal source memory space compression while processor coding deals with signal processor computational time compression. Their combination is named compression-designs and referred as Conde in short. A compelling and pedagogically appealing diagram will be discussed highlighting Conde's remarkable successful application to real-world knowledge-aided (KA) airborne moving target indicator (AMTI) radar.
HP-9825A HFRMP trajectory processor (#TRAJ), detailed description. [relative motion of the space shuttle orbiter and a free-flying payload

NASA Technical Reports Server (NTRS)

Kindall, S. M.

1980-01-01

The computer code for the trajectory processor (#TRAJ) of the high fidelity relative motion program is described. The #TRAJ processor is a 12-degrees-of-freedom trajectory integrator (6 degrees of freedom for each of two vehicles) which can be used to generate digital and graphical data describing the relative motion of the Space Shuttle Orbiter and a free-flying cylindrical payload. A listing of the code, coding standards and conventions, detailed flow charts, and discussions of the computational logic are included.
The Use of a Microcomputer Based Array Processor for Real Time Laser Velocimeter Data Processing

NASA Technical Reports Server (NTRS)

Meyers, James F.

1990-01-01

The application of an array processor to laser velocimeter data processing is presented. The hardware is described along with the method of parallel programming required by the array processor. A portion of the data processing program is described in detail. The increase in computational speed of a microcomputer equipped with an array processor is illustrated by comparative testing with a minicomputer.
Design of Low-Cost Impact Reporting System

DTIC Science & Technology

2015-12-01

Single Board Computers (SBC) available. Arduino and Raspberry Pi are very low cost and have huge communities for hardware design. Most of the SBC... Raspberry Pi Model B has a considerably faster processor than the Arduino. Although it provides only approximately 25 General Purpose Input and Output...reporting system must be able to operate on its own power for more than 2 or 3 hours. The Raspberry Pi Model B operates on 5 volts direct current at
DSS 13 Microprocessor Antenna Controller

NASA Technical Reports Server (NTRS)

Gosline, R. M.

1984-01-01

A microprocessor based antenna controller system developed as part of the unattended station project for DSS 13 is described. Both the hardware and software top level designs are presented and the major problems encounted are discussed. Developments useful to related projects include a JPL standard 15 line interface using a single board computer, a general purpose parser, a fast floating point to ASCII conversion technique, and experience gained in using off board floating point processors with the 8080 CPU.
Execution of a parallel edge-based Navier-Stokes solver on commodity graphics processor units

NASA Astrophysics Data System (ADS)

Corral, Roque; Gisbert, Fernando; Pueblas, Jesus

2017-02-01

The implementation of an edge-based three-dimensional Reynolds Average Navier-Stokes solver for unstructured grids able to run on multiple graphics processing units (GPUs) is presented. Loops over edges, which are the most time-consuming part of the solver, have been written to exploit the massively parallel capabilities of GPUs. Non-blocking communications between parallel processes and between the GPU and the central processor unit (CPU) have been used to enhance code scalability. The code is written using a mixture of C++ and OpenCL, to allow the execution of the source code on GPUs. The Message Passage Interface (MPI) library is used to allow the parallel execution of the solver on multiple GPUs. A comparative study of the solver parallel performance is carried out using a cluster of CPUs and another of GPUs. It is shown that a single GPU is up to 64 times faster than a single CPU core. The parallel scalability of the solver is mainly degraded due to the loss of computing efficiency of the GPU when the size of the case decreases. However, for large enough grid sizes, the scalability is strongly improved. A cluster featuring commodity GPUs and a high bandwidth network is ten times less costly and consumes 33% less energy than a CPU-based cluster with an equivalent computational power.
Optimization strategies for molecular dynamics programs on Cray computers and scalar work stations

NASA Astrophysics Data System (ADS)

Unekis, Michael J.; Rice, Betsy M.

1994-12-01

We present results of timing runs and different optimization strategies for a prototype molecular dynamics program that simulates shock waves in a two-dimensional (2-D) model of a reactive energetic solid. The performance of the program may be improved substantially by simple changes to the Fortran or by employing various vendor-supplied compiler optimizations. The optimum strategy varies among the machines used and will vary depending upon the details of the program. The effect of various compiler options and vendor-supplied subroutine calls is demonstrated. Comparison is made between two scalar workstations (IBM RS/6000 Model 370 and Model 530) and several Cray supercomputers (X-MP/48, Y-MP8/128, and C-90/16256). We find that for a scientific application program dominated by sequential, scalar statements, a relatively inexpensive high-end work station such as the IBM RS/60006 RISC series will outperform single processor performance of the Cray X-MP/48 and perform competitively with single processor performance of the Y-MP8/128 and C-9O/16256.
2nd Generation QUATARA Flight Computer Project

NASA Technical Reports Server (NTRS)

Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven

2015-01-01

Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
Programmable DNA-Mediated Multitasking Processor.

PubMed

Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

2015-04-30

Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.
Accuracy requirements of optical linear algebra processors in adaptive optics imaging systems

NASA Technical Reports Server (NTRS)

Downie, John D.

1990-01-01

A ground-based adaptive optics imaging telescope system attempts to improve image quality by detecting and correcting for atmospherically induced wavefront aberrations. The required control computations during each cycle will take a finite amount of time. Longer time delays result in larger values of residual wavefront error variance since the atmosphere continues to change during that time. Thus an optical processor may be well-suited for this task. This paper presents a study of the accuracy requirements in a general optical processor that will make it competitive with, or superior to, a conventional digital computer for the adaptive optics application. An optimization of the adaptive optics correction algorithm with respect to an optical processor's degree of accuracy is also briefly discussed.
Mindmodeling@Home. . . and Anywhere Else You Have Idle Processors

DTIC Science & Technology

2009-07-01

the continuous growth rate of end-user processing capability around the world. The first volunteer computing project was SETI @Home. It was... SETI @Home remains the longest running and one of the most popular volunteer computing projects in the world. This actually is an impressive feat...volunteer computing projects available to those interested in donating their idle processor time to scientific pursuits. Most of them, including SETI
An FPGA computing demo core for space charge simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Jinyuan; Huang, Yifei; /Fermilab

2009-01-01

In accelerator physics, space charge simulation requires large amount of computing power. In a particle system, each calculation requires time/resource consuming operations such as multiplications, divisions, and square roots. Because of the flexibility of field programmable gate arrays (FPGAs), we implemented this task with efficient use of the available computing resources and completely eliminated non-calculating operations that are indispensable in regular micro-processors (e.g. instruction fetch, instruction decoding, etc.). We designed and tested a 16-bit demo core for computing Coulomb's force in an Altera Cyclone II FPGA device. To save resources, the inverse square-root cube operation in our design is computedmore » using a memory look-up table addressed with nine to ten most significant non-zero bits. At 200 MHz internal clock, our demo core reaches a throughput of 200 M pairs/s/core, faster than a typical 2 GHz micro-processor by about a factor of 10. Temperature and power consumption of FPGAs were also lower than those of micro-processors. Fast and convenient, FPGAs can serve as alternatives to time-consuming micro-processors for space charge simulation.« less
Case for a field-programmable gate array multicore hybrid machine for an image-processing application

NASA Astrophysics Data System (ADS)

Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

2011-01-01

General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Conditional Dispersive Readout of a CMOS Single-Electron Memory Cell

NASA Astrophysics Data System (ADS)

Schaal, S.; Barraud, S.; Morton, J. J. L.; Gonzalez-Zalba, M. F.

2018-05-01

Quantum computers require interfaces with classical electronics for efficient qubit control, measurement, and fast data processing. Fabricating the qubit and the classical control layer using the same technology is appealing because it will facilitate the integration process, improving feedback speeds and offering potential solutions to wiring and layout challenges. Integrating classical and quantum devices monolithically, using complementary metal-oxide-semiconductor (CMOS) processes, enables the processor to profit from the most mature industrial technology for the fabrication of large-scale circuits. We demonstrate a CMOS single-electron memory cell composed of a single quantum dot and a transistor that locks charge on the quantum-dot gate. The single-electron memory cell is conditionally read out by gate-based dispersive sensing using a lumped-element L C resonator. The control field-effect transistor (FET) and quantum dot are fabricated on the same chip using fully depleted silicon-on-insulator technology. We obtain a charge sensitivity of δ q =95 ×10-6e Hz-1 /2 when the quantum-dot readout is enabled by the control FET, comparable to results without the control FET. Additionally, we observe a single-electron retention time on the order of a second when storing a single-electron charge on the quantum dot at millikelvin temperatures. These results demonstrate first steps towards time-based multiplexing of gate-based dispersive readout in CMOS quantum devices opening the path for the development of an all-silicon quantum-classical processor.
Dynamically allocating sets of fine-grained processors to running computations

NASA Technical Reports Server (NTRS)

Middleton, David

1988-01-01

Researchers explore an approach to using general purpose parallel computers which involves mapping hardware resources onto computations instead of mapping computations onto hardware. Problems such as processor allocation, task scheduling and load balancing, which have traditionally proven to be challenging, change significantly under this approach and may become amenable to new attacks. Researchers describe the implementation of this approach used by the FFP Machine whose computation and communication resources are repeatedly partitioned into disjoint groups that match the needs of available tasks from moment to moment. Several consequences of this system are examined.
Final report for the Tera Computer TTI CRADA

DOE Office of Scientific and Technical Information (OSTI.GOV)

Davidson, G.S.; Pavlakos, C.; Silva, C.

1997-01-01

Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less
TCP/IP Interface for the Satellite Orbit Analysis Program (SOAP)

NASA Technical Reports Server (NTRS)

Carnright, Robert; Stodden, David; Coggi, John

2009-01-01

The Transmission Control Protocol/ Internet protocol (TCP/IP) interface for the Satellite Orbit Analysis Program (SOAP) provides the means for the software to establish real-time interfaces with other software. Such interfaces can operate between two programs, either on the same computer or on different computers joined by a network. The SOAP TCP/IP module employs a client/server interface where SOAP is the server and other applications can be clients. Real-time interfaces between software offer a number of advantages over embedding all of the common functionality within a single program. One advantage is that they allow each program to divide the computation labor between processors or computers running the separate applications. Secondly, each program can be allowed to provide its own expertise domain with other programs able to use this expertise.
A Medical Language Processor for Two Indo-European Languages

PubMed Central

Nhan, Ngo Thanh; Sager, Naomi; Lyman, Margaret; Tick, Leo J.; Borst, François; Su, Yun

1989-01-01

The syntax and semantics of clinical narrative across Indo-European languages are quite similar, making it possible to envison a single medical language processor that can be adapted for different European languages. The Linguistic String Project of New York University is continuing the development of its Medical Language Processor in this direction. The paper describes how the processor operates on English and French.

FPGA Acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.

PubMed

Zierke, Stephanie; Bakos, Jason D

2010-04-12

Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA's on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors. We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10x speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture. Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs).
Finite elements and the method of conjugate gradients on a concurrent processor

NASA Technical Reports Server (NTRS)

Lyzenga, G. A.; Raefsky, A.; Hager, G. H.

1985-01-01

An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90 percent for sufficiently large problems.
Finite elements and the method of conjugate gradients on a concurrent processor

NASA Technical Reports Server (NTRS)

Lyzenga, G. A.; Raefsky, A.; Hager, B. H.

1984-01-01

An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90% for sufficiently large problems.
Distributed computing methodology for training neural networks in an image-guided diagnostic application.

PubMed

Plagianakos, V P; Magoulas, G D; Vrahatis, M N

2006-03-01

Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used.
Non-radiation hardened microprocessors in space-based remote sensing systems

NASA Astrophysics Data System (ADS)

DeCoursey, R.; Melton, Ryan; Estes, Robert R., Jr.

2006-09-01

The CALIPSO (Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations) mission is a comprehensive suite of active and passive sensors including a 20Hz 230mj Nd:YAG lidar, a visible wavelength Earth-looking camera and an imaging infrared radiometer. CALIPSO flies in formation with the Earth Observing System Post-Meridian (EOS PM) train, provides continuous, near-simultaneous measurements and is a planned 3 year mission. CALIPSO was launched into a 98 degree sun synchronous Earth orbit in April of 2006 to study clouds and aerosols and acquires over 5 gigabytes of data every 24 hours. Figure 1 shows the ground track of one CALIPSO orbit as well as high and low intensity South Atlantic Anomaly outlines. CALIPSO passes through the SAA several times each day. Spaced based remote sensing systems that include multiple instruments and/or instruments such as lidar generate large volumes of data and require robust real-time hardware and software mechanisms and high throughput processors. Due to onboard storage restrictions and telemetry downlink limitations these systems must pre-process and reduce the data before sending it to the ground. This onboard processing and realtime requirement load may mean that newer more powerful processors are needed even though acceptable radiation-hardened versions have not yet been released. CALIPSO's single board computer payload controller processor is actually a set of four (4) voting non-radiation hardened COTS Power PC 603r's built on a single width VME card by General Dynamics Advanced Information Systems (GDAIS). Significant radiation concerns for CALIPSO and other Low Earth Orbit (LEO) satellites include the South Atlantic Anomaly (SAA), the north and south poles and strong solar events. Over much of South America and extending into the South Atlantic Ocean (see figure 1) the Van Allen radiation belts dip to just 200-800km and spacecraft entering this area are subjected to high energy protons and experience higher than normal Single Event Upset (SEU) and Single Event Latch-up (SEL) rates. Although less significant, spacecraft flying in the area around the poles experience similar upsets. Finally, powerful solar proton events in the range of 10MeV/10pfu to 100MeV/1pfu as are forecasted and tracked by NOAA's Space Environment Center in Colorado can result in SingleEvent Upset (SEU), Single Event Latch-up (SEL) and permanent failures such as Single Event Gate Rupture (SEGR) in some technologies. (Galactic Cosmic Rays (GCRs) are another source, especially for gate rupture) CALIPSO mitigates common radiation concerns in its data handling through the use of redundant processors, radiation-hardened Application Specific Integrated Circuits (ASIC), hardware-based Error Detection and Correction (EDAC), processor and memory scrubbing, redundant boot code and mirrored files. After presenting a system overview this paper will expand on each of these strategies. Where applicable, related on-orbit data collected since the CALIPSO initial boot on May 4, 2006 will be noted.
Non Radiation Hardened Microprocessors in Spaced Based Remote Sensing Systems

NASA Technical Reports Server (NTRS)

Decoursey, Robert J.; Estes, Robert F.; Melton, Ryan

2006-01-01

The CALIPSO (Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observations) mission is a comprehensive suite of active and passive sensors including a 20Hz 230mj Nd:YAG lidar, a visible wavelength Earth-looking camera and an imaging infrared radiometer. CALIPSO flies in formation with the Earth Observing System Post-Meridian (EOS PM) train, provides continuous, near-simultaneous measurements and is a planned 3 year mission. CALIPSO was launched into a 98 degree sun synchronous Earth orbit in April of 2006 to study clouds and aerosols and acquires over 5 gigabytes of data every 24 hours. The ground track of one CALIPSO orbit as well as high and low intensity South Atlantic Anomaly outlines is shown. CALIPSO passes through the SAA several times each day. Spaced based remote sensing systems that include multiple instruments and/or instruments such as lidar generate large volumes of data and require robust real-time hardware and software mechanisms and high throughput processors. Due to onboard storage restrictions and telemetry downlink limitations these systems must pre-process and reduce the data before sending it to the ground. This onboard processing and realtime requirement load may mean that newer more powerful processors are needed even though acceptable radiation-hardened versions have not yet been released. CALIPSO's single board computer payload controller processor is actually a set of four (4) voting non-radiation hardened COTS Power PC 603r's built on a single width VME card by General Dynamics Advanced Information Systems (GDAIS). Significant radiation concerns for CALIPSO and other Low Earth Orbit (LEO) satellites include the South Atlantic Anomaly (SAA), the north and south poles and strong solar events. Over much of South America and extending into the South Atlantic Ocean the Van Allen radiation belts dip to just 200-800km and spacecraft entering this area are subjected to high energy protons and experience higher than normal Single Event Upset (SEU) and Single Event Latch-up (SEL) rates. Although less significant, spacecraft flying in the area around the poles experience similar upsets. Finally, powerful solar proton events in the range of 10MeV/10pfu to 100MeV/1pfu as are forecasted and tracked by NOAA's Space Environment Center in Colorado can result in Single Event Upset (SEU), Single Event Latch-up (SEL) and permanent failures such as Single Event Gate Rupture (SEGR) in some technologies. (Galactic Cosmic Rays (GCRs) are another source, especially for gate rupture) CALIPSO mitigates common radiation concerns in its data handling through the use of redundant processors, radiation-hardened Application Specific Integrated Circuits (ASIC), hardware-based Error Detection and Correction (EDAC), processor and memory scrubbing, redundant boot code and mirrored files. After presenting a system overview this paper will expand on each of these strategies. Where applicable, related on-orbit data collected since the CALIPSO initial boot on May 4, 2006 will be noted.
NASA Space Engineering Research Center for VLSI systems design

NASA Technical Reports Server (NTRS)

1991-01-01

This annual review reports the center's activities and findings on very large scale integration (VLSI) systems design for 1990, including project status, financial support, publications, the NASA Space Engineering Research Center (SERC) Symposium on VLSI Design, research results, and outreach programs. Processor chips completed or under development are listed. Research results summarized include a design technique to harden complementary metal oxide semiconductors (CMOS) memory circuits against single event upset (SEU); improved circuit design procedures; and advances in computer aided design (CAD), communications, computer architectures, and reliability design. Also described is a high school teacher program that exposes teachers to the fundamentals of digital logic design.
Accelerating molecular dynamic simulation on the cell processor and Playstation 3.

PubMed

Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S

2009-01-30

Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.
Allocating application to group of consecutive processors in fault-tolerant deadlock-free routing path defined by routers obeying same rules for path selection

DOEpatents

Leung, Vitus J [Albuquerque, NM; Phillips, Cynthia A [Albuquerque, NM; Bender, Michael A [East Northport, NY; Bunde, David P [Urbana, IL

2009-07-21

In a multiple processor computing apparatus, directional routing restrictions and a logical channel construct permit fault tolerant, deadlock-free routing. Processor allocation can be performed by creating a linear ordering of the processors based on routing rules used for routing communications between the processors. The linear ordering can assume a loop configuration, and bin-packing is applied to this loop configuration. The interconnection of the processors can be conceptualized as a generally rectangular 3-dimensional grid, and the MC allocation algorithm is applied with respect to the 3-dimensional grid.
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE Office of Scientific and Technical Information (OSTI.GOV)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the World- wide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. This paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE PAGES

Childers, J. T.; Uram, T. D.; LeCompte, T. J.; ...

2016-09-29

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
Adapting the serial Alpgen parton-interaction generator to simulate LHC collisions on millions of parallel threads

DOE Office of Scientific and Technical Information (OSTI.GOV)

Childers, J. T.; Uram, T. D.; LeCompte, T. J.

As the LHC moves to higher energies and luminosity, the demand for computing resources increases accordingly and will soon outpace the growth of the Worldwide LHC Computing Grid. To meet this greater demand, event generation Monte Carlo was targeted for adaptation to run on Mira, the supercomputer at the Argonne Leadership Computing Facility. Alpgen is a Monte Carlo event generation application that is used by LHC experiments in the simulation of collisions that take place in the Large Hadron Collider. Finally, this paper details the process by which Alpgen was adapted from a single-processor serial-application to a large-scale parallel-application andmore » the performance that was achieved.« less
A Strassen-Newton algorithm for high-speed parallelizable matrix inversion

NASA Technical Reports Server (NTRS)

Bailey, David H.; Ferguson, Helaman R. P.

1988-01-01

Techniques are described for computing matrix inverses by algorithms that are highly suited to massively parallel computation. The techniques are based on an algorithm suggested by Strassen (1969). Variations of this scheme use matrix Newton iterations and other methods to improve the numerical stability while at the same time preserving a very high level of parallelism. One-processor Cray-2 implementations of these schemes range from one that is up to 55 percent faster than a conventional library routine to one that is slower than a library routine but achieves excellent numerical stability. The problem of computing the solution to a single set of linear equations is discussed, and it is shown that this problem can also be solved efficiently using these techniques.
The science of computing - Parallel computation

NASA Technical Reports Server (NTRS)

Denning, P. J.

1985-01-01

Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
Mass Memory Storage Devices for AN/SLQ-32(V).

DTIC Science & Technology

1985-06-01

tactical programs and libraries into the AN/UYK-19 computer , the RP-16 microprocessor, and other peripheral processors (e.g., ADLS and Band 1) will be...software must be loaded into computer memory from the 4-track magnetic tape cartridges (MTCs) on which the programs are stored. Program load begins...software. Future computer programs , which will reside in peripheral processors, include the Automated Decoy Launching System (ADLS) and Band 1. As
Computer Sciences and Data Systems, volume 2

NASA Technical Reports Server (NTRS)

1987-01-01

Topics addressed include: data storage; information network architecture; VHSIC technology; fiber optics; laser applications; distributed processing; spaceborne optical disk controller; massively parallel processors; and advanced digital SAR processors.
Efficiency of parallel direct optimization

NASA Technical Reports Server (NTRS)

Janies, D. A.; Wheeler, W. C.

2001-01-01

Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Non-Boolean computing with nanomagnets for computer vision applications

NASA Astrophysics Data System (ADS)

Bhanja, Sanjukta; Karunaratne, D. K.; Panchumarthy, Ravi; Rajaram, Srinath; Sarkar, Sudeep

2016-02-01

The field of nanomagnetism has recently attracted tremendous attention as it can potentially deliver low-power, high-speed and dense non-volatile memories. It is now possible to engineer the size, shape, spacing, orientation and composition of sub-100 nm magnetic structures. This has spurred the exploration of nanomagnets for unconventional computing paradigms. Here, we harness the energy-minimization nature of nanomagnetic systems to solve the quadratic optimization problems that arise in computer vision applications, which are computationally expensive. By exploiting the magnetization states of nanomagnetic disks as state representations of a vortex and single domain, we develop a magnetic Hamiltonian and implement it in a magnetic system that can identify the salient features of a given image with more than 85% true positive rate. These results show the potential of this alternative computing method to develop a magnetic coprocessor that might solve complex problems in fewer clock cycles than traditional processors.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

NASA Technical Reports Server (NTRS)

Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

2017-01-01

Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Safe and Efficient Support for Embeded Multi-Processors in ADA

NASA Astrophysics Data System (ADS)

Ruiz, Jose F.

2010-08-01

New software demands increasing processing power, and multi-processor platforms are spreading as the answer to achieve the required performance. Embedded real-time systems are also subject to this trend, but in the case of real-time mission-critical systems, the properties of reliability, predictability and analyzability are also paramount. The Ada 2005 language defined a subset of its tasking model, the Ravenscar profile, that provides the basis for the implementation of deterministic and time analyzable applications on top of a streamlined run-time system. This Ravenscar tasking profile, originally designed for single processors, has proven remarkably useful for modelling verifiable real-time single-processor systems. This paper proposes a simple extension to the Ravenscar profile to support multi-processor systems using a fully partitioned approach. The implementation of this scheme is simple, and it can be used to develop applications amenable to schedulability analysis.

Conjugate-Gradient Algorithms For Dynamics Of Manipulators

NASA Technical Reports Server (NTRS)

Fijany, Amir; Scheid, Robert E.

1993-01-01

Algorithms for serial and parallel computation of forward dynamics of multiple-link robotic manipulators by conjugate-gradient method developed. Parallel algorithms have potential for speedup of computations on multiple linked, specialized processors implemented in very-large-scale integrated circuits. Such processors used to stimulate dynamics, possibly faster than in real time, for purposes of planning and control.
Word Processor versus "The Pencil" Effects on Writing.

ERIC Educational Resources Information Center

Schanck, Emily T.

In a study to determine the effects of writing on the computer versus traditional writing by hand, 22 fourth grade students were randomly assigned to samples using either a computer or paper and pencil. The study hypothesized that (1) children are not more willing to revise and improve their writing using a word processor when compared to the…
Closing the Gap: Cybersecurity for U.S. Forces and Commands

DTIC Science & Technology

2017-03-30

Dickson, Ph.D. Professor of Military Studies , JAWS Thesis Advisor Kevin Therrien, Col, USAF Committee Member Stephen Rogers, Colonel, USA Director...infrastructures, and includes the Internet, telecommunications networks, computer systems, and embedded processors and controllers in critical industries.”5...of information technology infrastructures, including the Internet, telecommunications networks, computer systems, and embedded processors and
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Sohn, Andrew

1996-01-01

Dynamic mesh adaptation on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load inbalances among processors on a parallel machine. This paper described the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution coast is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35 percent of the mesh is randomly adapted. For large scale scientific computations, our load balancing strategy gives an almost sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remappier yields processor assignments that are less than 3 percent of the optimal solutions, but requires only 1 percent of the computational time.
Recursive Hierarchical Image Segmentation by Region Growing and Constrained Spectral Clustering

NASA Technical Reports Server (NTRS)

Tilton, James C.

2002-01-01

This paper describes an algorithm for hierarchical image segmentation (referred to as HSEG) and its recursive formulation (referred to as RHSEG). The HSEG algorithm is a hybrid of region growing and constrained spectral clustering that produces a hierarchical set of image segmentations based on detected convergence points. In the main, HSEG employs the hierarchical stepwise optimization (HS WO) approach to region growing, which seeks to produce segmentations that are more optimized than those produced by more classic approaches to region growing. In addition, HSEG optionally interjects between HSWO region growing iterations merges between spatially non-adjacent regions (i.e., spectrally based merging or clustering) constrained by a threshold derived from the previous HSWO region growing iteration. While the addition of constrained spectral clustering improves the segmentation results, especially for larger images, it also significantly increases HSEG's computational requirements. To counteract this, a computationally efficient recursive, divide-and-conquer, implementation of HSEG (RHSEG) has been devised and is described herein. Included in this description is special code that is required to avoid processing artifacts caused by RHSEG s recursive subdivision of the image data. Implementations for single processor and for multiple processor computer systems are described. Results with Landsat TM data are included comparing HSEG with classic region growing. Finally, an application to image information mining and knowledge discovery is discussed.
Instrumentation, performance visualization, and debugging tools for multiprocessors

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Fineman, Charles E.; Hontalas, Philip J.

1991-01-01

The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs.
PREMAQ: A NEW PRE-PROCESSOR TO CMAQ FOR AIR-QUALITY FORECASTING

EPA Science Inventory

A new pre-processor to CMAQ (PREMAQ) has been developed as part of the national air-quality forecasting system. PREMAQ combines the functionality of MCIP and parts of SMOKE in a single real-time processor. PREMAQ was specifically designed to link NCEP's Eta model with CMAQ, and...
Using Nested Contractions and a Hierarchical Tensor Format To Compute Vibrational Spectra of Molecules with Seven Atoms.

PubMed

Thomas, Phillip S; Carrington, Tucker

2015-12-31

We propose a method for solving the vibrational Schrödinger equation with which one can compute hundreds of energy levels of seven-atom molecules using at most a few gigabytes of memory. It uses nested contractions in conjunction with the reduced-rank block power method (RRBPM) described in J. Chem. Phys. 2014, 140, 174111. Successive basis contractions are organized into a tree, the nodes of which are associated with eigenfunctions of reduced-dimension Hamiltonians. The RRBPM is used recursively to compute eigenfunctions of nodes in bases of products of reduced-dimension eigenfunctions of nodes with fewer coordinates. The corresponding vectors are tensors in what is called CP-format. The final wave functions are therefore represented in a hierarchical CP-format. Computational efficiency and accuracy are significantly improved by representing the Hamiltonian in the same hierarchical format as the wave function. We demonstrate that with this hierarchical RRBPM it is possible to compute energy levels of a 64-D coupled-oscillator model Hamiltonian and also of acetonitrile (CH3CN) and ethylene oxide (C2H4O), for which we use quartic potentials. The most accurate acetonitrile calculation uses 139 MB of memory and takes 3.2 h on a single processor. The most accurate ethylene oxide calculation uses 6.1 GB of memory and takes 14 d on 63 processors. The hierarchical RRBPM shatters the memory barrier that impedes the calculation of vibrational spectra.
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
System and method for representing and manipulating three-dimensional objects on massively parallel architectures

DOEpatents

Karasick, Michael S.; Strip, David R.

1996-01-01

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.
SPAR improved structural-fluid dynamic analysis capability

NASA Technical Reports Server (NTRS)

Pearson, M. L.

1985-01-01

The results of a study whose objective was to improve the operation of the SPAR computer code by improving efficiency, user features, and documentation is presented. Additional capability was added to the SPAR arithmetic utility system, including trigonometric functions, numerical integration, interpolation, and matrix combinations. Improvements were made in the EIG processor. A processor was created to compute and store principal stresses in table-format data sets. An additional capability was developed and incorporated into the plot processor which permits plotting directly from table-format data sets. Documentation of all these features is provided in the form of updates to the SPAR users manual.
Accuracy requirements of optical linear algebra processors in adaptive optics imaging systems.

PubMed

Downie, J D; Goodman, J W

1989-10-15

A ground-based adaptive optics imaging telescope system attempts to improve image quality by measuring and correcting for atmospherically induced wavefront aberrations. The necessary control computations during each cycle will take a finite amount of time, which adds to the residual error variance since the atmosphere continues to change during that time. Thus an optical processor may be well-suited for this task. This paper investigates this possibility by studying the accuracy requirements in a general optical processor that will make it competitive with, or superior to, a conventional digital computer for adaptive optics use.
Development for SSV on a parallel processing system (PARAGON)

NASA Astrophysics Data System (ADS)

Gothard, Benny M.; Allmen, Mark; Carroll, Michael J.; Rich, Dan

1995-12-01

A goal of the surrogate semi-autonomous vehicle (SSV) program is to have multiple vehicles navigate autonomously and cooperatively with other vehicles. This paper describes the process and tools used in porting UGV/SSV (unmanned ground vehicle) autonomous mobility and target recognition algorithms from a SISD (single instruction single data) processor architecture (i.e., a Sun SPARC workstation running C/UNIX) to a MIMD (multiple instruction multiple data) parallel processor architecture (i.e., PARAGON-a parallel set of i860 processors running C/UNIX). It discusses the gains in performance and the pitfalls of such a venture. It also examines the merits of this processor architecture (based on this conceptual prototyping effort) and programming paradigm to meet the final SSV demonstration requirements.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-01-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-09-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
A 32-bit NMOS microprocessor with a large register file

NASA Astrophysics Data System (ADS)

Sherburne, R. W., Jr.; Katevenis, M. G. H.; Patterson, D. A.; Sequin, C. H.

1984-10-01

Two scaled versions of a 32-bit NMOS reduced instruction set computer CPU, called RISC II, have been implemented on two different processing lines using the simple Mead and Conway layout rules with lambda values of 2 and 1.5 microns (corresponding to drawn gate lengths of 4 and 3 microns), respectively. The design utilizes a small set of simple instructions in conjunction with a large register file in order to provide high performance. This approach has resulted in two surprisingly powerful single-chip processors.
Ordering of guarded and unguarded stores for no-sync I/O

DOEpatents

Gara, Alan; Ohmacht, Martin

2013-06-25

A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
Treecode with a Special-Purpose Processor

NASA Astrophysics Data System (ADS)

Makino, Junichiro

1991-08-01

We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops.
Development of hardware accelerator for molecular dynamics simulations: a computation board that calculates nonbonded interactions in cooperation with fast multipole method.

PubMed

Amisaki, Takashi; Toyoda, Shinjiro; Miyagawa, Hiroh; Kitamura, Kunihiro

2003-04-15

Evaluation of long-range Coulombic interactions still represents a bottleneck in the molecular dynamics (MD) simulations of biological macromolecules. Despite the advent of sophisticated fast algorithms, such as the fast multipole method (FMM), accurate simulations still demand a great amount of computation time due to the accuracy/speed trade-off inherently involved in these algorithms. Unless higher order multipole expansions, which are extremely expensive to evaluate, are employed, a large amount of the execution time is still spent in directly calculating particle-particle interactions within the nearby region of each particle. To reduce this execution time for pair interactions, we developed a computation unit (board), called MD-Engine II, that calculates nonbonded pairwise interactions using a specially designed hardware. Four custom arithmetic-processors and a processor for memory manipulation ("particle processor") are mounted on the computation board. The arithmetic processors are responsible for calculation of the pair interactions. The particle processor plays a central role in realizing efficient cooperation with the FMM. The results of a series of 50-ps MD simulations of a protein-water system (50,764 atoms) indicated that a more stringent setting of accuracy in FMM computation, compared with those previously reported, was required for accurate simulations over long time periods. Such a level of accuracy was efficiently achieved using the cooperative calculations of the FMM and MD-Engine II. On an Alpha 21264 PC, the FMM computation at a moderate but tolerable level of accuracy was accelerated by a factor of 16.0 using three boards. At a high level of accuracy, the cooperative calculation achieved a 22.7-fold acceleration over the corresponding conventional FMM calculation. In the cooperative calculations of the FMM and MD-Engine II, it was possible to achieve more accurate computation at a comparable execution time by incorporating larger nearby regions. Copyright 2003 Wiley Periodicals, Inc. J Comput Chem 24: 582-592, 2003
GPU-based Parallel Application Design for Emerging Mobile Devices

NASA Astrophysics Data System (ADS)

Gupta, Kshitij

A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).

ARC-2012-ACD12-0020-005

NASA Image and Video Library

2012-02-10

Then and Now: These images illustrate the dramatic improvement in NASA computing power over the last 23 years, and its effect on the number of grid points used for flow simulations. At left, an image from the first full-body Navier-Stokes simulation (1988) of an F-16 fighter jet showing pressure on the aircraft body, and fore-body streamlines at Mach 0.90. This steady-state solution took 25 hours using a single Cray X-MP processor to solve the 500,000 grid-point problem. Investigator: Neal Chaderjian, NASA Ames Research Center At right, a 2011 snapshot from a Navier-Stokes simulation of a V-22 Osprey rotorcraft in hover. The blade vortices interact with the smaller turbulent structures. This very detailed simulation used 660 million grid points, and ran on 1536 processors on the Pleiades supercomputer for 180 hours. Investigator: Neal Chaderjian, NASA Ames Research Center; Image: Tim Sandstrom, NASA Ames Research Center
An Evaluation of an Ada Implementation of the Rete Algorithm for Embedded Flight Processors

DTIC Science & Technology

1990-12-01

computers was desired. The VAX VMS operating system has many built-in methods for determining program performance (including VAX PCA), but these methods... overviev , of the target environment-- the MIL-STD-1750A VHSIC Avionic Modular Processor ( VA.IP, running under the Ada Avionics Real-Time Software (AARTS... computers . Mil-STD-1750A, the Air Force’s standard flight computer architecture, however, places severe constraints on applications software processing
Towards the simulation of molecular collisions with a superconducting quantum computer

NASA Astrophysics Data System (ADS)

Geller, Michael

2013-05-01

I will discuss the prospects for the use of large-scale, error-corrected quantum computers to simulate complex quantum dynamics such as molecular collisions. This will likely require millions qubits. I will also discuss an alternative approach [M. R. Geller et al., arXiv:1210.5260] that is ideally suited for today's superconducting circuits, which uses the single-excitation subspace (SES) of a system of n tunably coupled qubits. The SES method allows many operations in the unitary group SU(n) to be implemented in a single step, bypassing the need for elementary gates, thereby making large computations possible without error correction. The method enables universal quantum simulation, including simulation of the time-dependent Schrodinger equation, and we argue that a 1000-qubit SES processor should be capable of achieving quantum speedup relative to a petaflop supercomputer. We speculate on the utility and practicality of such a simulator for atomic and molecular collision physics. Work supported by the US National Science Foundation CDI program.
Communications systems and methods for subsea processors

DOEpatents

Gutierrez, Jose; Pereira, Luis

2016-04-26

A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.
Efficient Interconnection Schemes for VLSI and Parallel Computation

DTIC Science & Technology

1989-08-01

Definition: Let R be a routing network. A set S of wires in R is a (directed) cut if it partitions the network into two sets of processors A and B ...such that every path from a processor in A to a processor in B contains a wire in S. The capacity cap(S) is the number of wires in the cut. For a set of...messages M, define the load load(M, S) of M on a cut S to be the number of messages in M from a processor in A to a processor in B . The load factor
Cloud Quantum Computing of an Atomic Nucleus

NASA Astrophysics Data System (ADS)

Dumitrescu, E. F.; McCaskey, A. J.; Hagen, G.; Jansen, G. R.; Morris, T. D.; Papenbrock, T.; Pooser, R. C.; Dean, D. J.; Lougovski, P.

2018-05-01

We report a quantum simulation of the deuteron binding energy on quantum processors accessed via cloud servers. We use a Hamiltonian from pionless effective field theory at leading order. We design a low-depth version of the unitary coupled-cluster ansatz, use the variational quantum eigensolver algorithm, and compute the binding energy to within a few percent. Our work is the first step towards scalable nuclear structure computations on a quantum processor via the cloud, and it sheds light on how to map scientific computing applications onto nascent quantum devices.
Cloud Quantum Computing of an Atomic Nucleus

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dumitrescu, Eugene F.; McCaskey, Alex J.; Hagen, Gaute

Here, we report a quantum simulation of the deuteron binding energy on quantum processors accessed via cloud servers. We use a Hamiltonian from pionless effective field theory at leading order. We design a low-depth version of the unitary coupled-cluster ansatz, use the variational quantum eigensolver algorithm, and compute the binding energy to within a few percent. Our work is the first step towards scalable nuclear structure computations on a quantum processor via the cloud, and it sheds light on how to map scientific computing applications onto nascent quantum devices.
Cloud Quantum Computing of an Atomic Nucleus.

PubMed

Dumitrescu, E F; McCaskey, A J; Hagen, G; Jansen, G R; Morris, T D; Papenbrock, T; Pooser, R C; Dean, D J; Lougovski, P

2018-05-25

We report a quantum simulation of the deuteron binding energy on quantum processors accessed via cloud servers. We use a Hamiltonian from pionless effective field theory at leading order. We design a low-depth version of the unitary coupled-cluster ansatz, use the variational quantum eigensolver algorithm, and compute the binding energy to within a few percent. Our work is the first step towards scalable nuclear structure computations on a quantum processor via the cloud, and it sheds light on how to map scientific computing applications onto nascent quantum devices.
Computing NLTE Opacities -- Node Level Parallel Calculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holladay, Daniel

Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Cloud Quantum Computing of an Atomic Nucleus

DOE PAGES

Dumitrescu, Eugene F.; McCaskey, Alex J.; Hagen, Gaute; ...

2018-05-23

Here, we report a quantum simulation of the deuteron binding energy on quantum processors accessed via cloud servers. We use a Hamiltonian from pionless effective field theory at leading order. We design a low-depth version of the unitary coupled-cluster ansatz, use the variational quantum eigensolver algorithm, and compute the binding energy to within a few percent. Our work is the first step towards scalable nuclear structure computations on a quantum processor via the cloud, and it sheds light on how to map scientific computing applications onto nascent quantum devices.
PCI-based WILDFIRE reconfigurable computing engines

NASA Astrophysics Data System (ADS)

Fross, Bradley K.; Donaldson, Robert L.; Palmer, Douglas J.

1996-10-01

WILDFORCE is the first PCI-based custom reconfigurable computer that is based on the Splash 2 technology transferred from the National Security Agency and the Institute for Defense Analyses, Supercomputing Research Center (SRC). The WILDFORCE architecture has many of the features of the WILDFIRE computer, such as field- programmable gate array (FPGA) based processing elements, linear array and crossbar interconnection, and high- performance memory and I/O subsystems. New features introduced in the PCI-based WILDFIRE systems include memory/processor options that can be added to any processing element. These options include static and dynamic memory, digital signal processors (DSPs), FPGAs, and microprocessors. In addition to memory/processor options, many different application specific connectors can be used to extend the I/O capabilities of the system, including systolic I/O, camera input and video display output. This paper also discusses how this new PCI-based reconfigurable computing engine is used for rapid-prototyping, real-time video processing and other DSP applications.
Accelerating Climate Simulations Through Hybrid Computing

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark

2009-01-01

Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.
JPRS Report, Science & Technology, USSR: Computers, Control Systems and Machines

DTIC Science & Technology

1989-03-14

optimizatsii slozhnykh sistem (Coding Theory and Complex System Optimization ). Alma-Ata, Nauka Press, 1977, pp. 8-16. 11. Author’s certificate number...Interpreter Specifics [0. I. Amvrosova] ............................................. 141 Creation of Modern Computer Systems for Complex Ecological...processor can be designed to decrease degradation upon failure and assure more reliable processor operation, without requiring more complex software or
Dynamic load balance scheme for the DSMC algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Jin; Geng, Xiangren; Jiang, Dingwu

The direct simulation Monte Carlo (DSMC) algorithm, devised by Bird, has been used over a wide range of various rarified flow problems in the past 40 years. While the DSMC is suitable for the parallel implementation on powerful multi-processor architecture, it also introduces a large load imbalance across the processor array, even for small examples. The load imposed on a processor by a DSMC calculation is determined to a large extent by the total of simulator particles upon it. Since most flows are impulsively started with initial distribution of particles which is surely quite different from the steady state, themore » total of simulator particles will change dramatically. The load balance based upon an initial distribution of particles will break down as the steady state of flow is reached. The load imbalance and huge computational cost of DSMC has limited its application to rarefied or simple transitional flows. In this paper, by taking advantage of METIS, a software for partitioning unstructured graphs, and taking the total of simulator particles in each cell as a weight information, the repartitioning based upon the principle that each processor handles approximately the equal total of simulator particles has been achieved. The computation must pause several times to renew the total of simulator particles in each processor and repartition the whole domain again. Thus the load balance across the processors array holds in the duration of computation. The parallel efficiency can be improved effectively. The benchmark solution of a cylinder submerged in hypersonic flow has been simulated numerically. Besides, hypersonic flow past around a complex wing-body configuration has also been simulated. The results have displayed that, for both of cases, the computational time can be reduced by about 50%.« less
Hardware multiplier processor

DOEpatents

Pierce, Paul E.

1986-01-01

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Hardware multiplier processor

DOEpatents

Pierce, P.E.

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
The AIS-5000 parallel processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmitt, L.A.; Wilson, S.S.

1988-05-01

The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In thismore » paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.« less
Interprocessor bus switching system for simultaneous communication in plural bus parallel processing system

DOEpatents

Atac, R.; Fischler, M.S.; Husby, D.E.

1991-01-15

A bus switching apparatus and method for multiple processor computer systems comprises a plurality of bus switches interconnected by branch buses. Each processor or other module of the system is connected to a spigot of a bus switch. Each bus switch also serves as part of a backplane of a modular crate hardware package. A processor initiates communication with another processor by identifying that other processor. The bus switch to which the initiating processor is connected identifies and secures, if possible, a path to that other processor, either directly or via one or more other bus switches which operate similarly. If a particular desired path through a given bus switch is not available to be used, an alternate path is considered, identified and secured. 11 figures.
Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

DOEpatents

Chatterjee, Siddhartha [Yorktown Heights, NY; Gunnels, John A [Brewster, NY

2011-11-08

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.
Interprocessor bus switching system for simultaneous communication in plural bus parallel processing system

DOEpatents

Atac, Robert; Fischler, Mark S.; Husby, Donald E.

1991-01-01

A bus switching apparatus and method for multiple processor computer systems comprises a plurality of bus switches interconnected by branch buses. Each processor or other module of the system is connected to a spigot of a bus switch. Each bus switch also serves as part of a backplane of a modular crate hardware package. A processor initiates communication with another processor by identifying that other processor. The bus switch to which the initiating processor is connected identifies and secures, if possible, a path to that other processor, either directly or via one or more other bus switches which operate similarly. If a particular desired path through a given bus switch is not available to be used, an alternate path is considered, identified and secured.

System Level RBDO for Military Ground Vehicles using High Performance Computing

DTIC Science & Technology

2008-01-01

platform. Only the analyses that required more than 24 processors were conducted on the Onyx 350 due to the limited number of processors on the...optimization constraints varied. The queues set the number of processors and number of finite element code licenses available to the analyses. sgi ONYX ...3900: unix 24 MIPS R16000 PROCESSORS 4 IR2 GRAPHICS PIPES 4 IR3 GRAPHICS PIPES 24 GBYTES MEMORY 36 GBYTES LOCAL DISK SPACE sgi ONYX 350: unix 32 MIPS
A data base processor semantics specification package

NASA Technical Reports Server (NTRS)

Fishwick, P. A.

1983-01-01

A Semantics Specification Package (DBPSSP) for the Intel Data Base Processor (DBP) is defined. DBPSSP serves as a collection of cross assembly tools that allow the analyst to assemble request blocks on the host computer for passage to the DBP. The assembly tools discussed in this report may be effectively used in conjunction with a DBP compatible data communications protocol to form a query processor, precompiler, or file management system for the database processor. The source modules representing the components of DBPSSP are fully commented and included.
Parallel, stochastic measurement of molecular surface area.

PubMed

Juba, Derek; Varshney, Amitabh

2008-08-01

Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation that maps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
Impact of Load Balancing on Unstructured Adaptive Grid Computations for Distributed-Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Simon, Horst D.

1996-01-01

The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
Multi-level Hierarchical Poly Tree computer architectures

NASA Technical Reports Server (NTRS)

Padovan, Joe; Gute, Doug

1990-01-01

Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
An Application-Based Performance Characterization of the Columbia Supercluster

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash

2005-01-01

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.
FFT Computation with Systolic Arrays, A New Architecture

NASA Technical Reports Server (NTRS)

Boriakoff, Valentin

1994-01-01

The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Chen, Yousu; Wu, Di

2015-12-09

Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Parallelising a molecular dynamics algorithm on a multi-processor workstation

NASA Astrophysics Data System (ADS)

Müller-Plathe, Florian

1990-12-01

The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-08-12

A data flow computer and method of computing are disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1988-07-22

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor G.

1995-01-01

A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-10-07

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Optimizing Performance of Combustion Chemistry Solvers on Intel's Many Integrated Core (MIC) Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sitaraman, Hariswaran; Grout, Ray W

This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reed, D.A.; Grunwald, D.C.

The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less
Implementation of kernels on the Maestro processor

NASA Astrophysics Data System (ADS)

Suh, Jinwoo; Kang, D. I. D.; Crago, S. P.

Currently, most microprocessors use multiple cores to increase performance while limiting power usage. Some processors use not just a few cores, but tens of cores or even 100 cores. One such many-core microprocessor is the Maestro processor, which is based on Tilera's TILE64 processor. The Maestro chip is a 49-core, general-purpose, radiation-hardened processor designed for space applications. The Maestro processor, unlike the TILE64, has a floating point unit (FPU) in each core for improved floating point performance. The Maestro processor runs at 342 MHz clock frequency. On the Maestro processor, we implemented several widely used kernels: matrix multiplication, vector add, FIR filter, and FFT. We measured and analyzed the performance of these kernels. The achieved performance was up to 5.7 GFLOPS, and the speedup compared to single tile was up to 49 using 49 tiles.
Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2015-04-01

PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
Comparing the Performance of Two Dynamic Load Distribution Methods

NASA Technical Reports Server (NTRS)

Kale, L. V.

1987-01-01

Parallel processing of symbolic computations on a message-passing multi-processor presents one challenge: To effectively utilize the available processors, the load must be distributed uniformly to all the processors. However, the structure of these computations cannot be predicted in advance. go, static scheduling methods are not applicable. In this paper, we compare the performance of two dynamic, distributed load balancing methods with extensive simulation studies. The two schemes are: the Contracting Within a Neighborhood (CWN) scheme proposed by us, and the Gradient Model proposed by Lin and Keller. We conclude that although simpler, the CWN is significantly more effective at distributing the work than the Gradient model.

Fault-tolerant battery system employing intra-battery network architecture

DOEpatents

Hagen, Ronald A.; Chen, Kenneth W.; Comte, Christophe; Knudson, Orlin B.; Rouillard, Jean

2000-01-01

A distributed energy storing system employing a communications network is disclosed. A distributed battery system includes a number of energy storing modules, each of which includes a processor and communications interface. In a network mode of operation, a battery computer communicates with each of the module processors over an intra-battery network and cooperates with individual module processors to coordinate module monitoring and control operations. The battery computer monitors a number of battery and module conditions, including the potential and current state of the battery and individual modules, and the conditions of the battery's thermal management system. An over-discharge protection system, equalization adjustment system, and communications system are also controlled by the battery computer. The battery computer logs and reports various status data on battery level conditions which may be reported to a separate system platform computer. A module transitions to a stand-alone mode of operation if the module detects an absence of communication connectivity with the battery computer. A module which operates in a stand-alone mode performs various monitoring and control functions locally within the module to ensure safe and continued operation.
Novel Highly Parallel and Systolic Architectures Using Quantum Dot-Based Hardware

NASA Technical Reports Server (NTRS)

Fijany, Amir; Toomarian, Benny N.; Spotnitz, Matthew

1997-01-01

VLSI technology has made possible the integration of massive number of components (processors, memory, etc.) into a single chip. In VLSI design, memory and processing power are relatively cheap and the main emphasis of the design is on reducing the overall interconnection complexity since data routing costs dominate the power, time, and area required to implement a computation. Communication is costly because wires occupy the most space on a circuit and it can also degrade clock time. In fact, much of the complexity (and hence the cost) of VLSI design results from minimization of data routing. The main difficulty in VLSI routing is due to the fact that crossing of the lines carrying data, instruction, control, etc. is not possible in a plane. Thus, in order to meet this constraint, the VLSI design aims at keeping the architecture highly regular with local and short interconnection. As a result, while the high level of integration has opened the way for massively parallel computation, practical and full exploitation of such a capability in many applications of interest has been hindered by the constraints on interconnection pattern. More precisely. the use of only localized communication significantly simplifies the design of interconnection architecture but at the expense of somewhat restricted class of applications. For example, there are currently commercially available products integrating; hundreds of simple processor elements within a single chip. However, the lack of adequate interconnection pattern among these processing elements make them inefficient for exploiting a large degree of parallelism in many applications.
Accelerating Wright–Fisher Forward Simulations on the Graphics Processing Unit

PubMed Central

Lawrie, David S.

2017-01-01

Forward Wright–Fisher simulations are powerful in their ability to model complex demography and selection scenarios, but suffer from slow execution on the Central Processor Unit (CPU), thus limiting their usefulness. However, the single-locus Wright–Fisher forward algorithm is exceedingly parallelizable, with many steps that are so-called “embarrassingly parallel,” consisting of a vast number of individual computations that are all independent of each other and thus capable of being performed concurrently. The rise of modern Graphics Processing Units (GPUs) and programming languages designed to leverage the inherent parallel nature of these processors have allowed researchers to dramatically speed up many programs that have such high arithmetic intensity and intrinsic concurrency. The presented GPU Optimized Wright–Fisher simulation, or “GO Fish” for short, can be used to simulate arbitrary selection and demographic scenarios while running over 250-fold faster than its serial counterpart on the CPU. Even modest GPU hardware can achieve an impressive speedup of over two orders of magnitude. With simulations so accelerated, one can not only do quick parametric bootstrapping of previously estimated parameters, but also use simulated results to calculate the likelihoods and summary statistics of demographic and selection models against real polymorphism data, all without restricting the demographic and selection scenarios that can be modeled or requiring approximations to the single-locus forward algorithm for efficiency. Further, as many of the parallel programming techniques used in this simulation can be applied to other computationally intensive algorithms important in population genetics, GO Fish serves as an exciting template for future research into accelerating computation in evolution. GO Fish is part of the Parallel PopGen Package available at: http://dl42.github.io/ParallelPopGen/. PMID:28768689
Document Image Parsing and Understanding using Neuromorphic Architecture

DTIC Science & Technology

2015-03-01

processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored to reduce the processing...developed to reduce the processing speed at different layers. In the pattern matching layer, the computing power of multicore processors is explored... cortex where the complex data is reduced to abstract representations. The abstract representation is compared to stored patterns in massively parallel
Resiliency in Future Cyber Combat

DTIC Science & Technology

2016-04-04

including the Internet , telecommunications networks, computer systems, and embed- ded processors and controllers.”6 One important point emerging from the...definition is that while the Internet is part of cyberspace, it is not all of cyberspace. Any computer processor capable of communicating with a...central proces- sor on a modern car are all part of cyberspace, although only some of them are routinely connected to the Internet . Most modern
Green Secure Processors: Towards Power-Efficient Secure Processor Design

NASA Astrophysics Data System (ADS)

Chhabra, Siddhartha; Solihin, Yan

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.
Neurovision processor for designing intelligent sensors

NASA Astrophysics Data System (ADS)

Gupta, Madan M.; Knopf, George K.

1992-03-01

A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.
RTSPM: real-time Linux control software for scanning probe microscopy.

PubMed

Chandrasekhar, V; Mehta, M M

2013-01-01

Real time computer control is an essential feature of scanning probe microscopes, which have become important tools for the characterization and investigation of nanometer scale samples. Most commercial (and some open-source) scanning probe data acquisition software uses digital signal processors to handle the real time data processing and control, which adds to the expense and complexity of the control software. We describe here scan control software that uses a single computer and a data acquisition card to acquire scan data. The computer runs an open-source real time Linux kernel, which permits fast acquisition and control while maintaining a responsive graphical user interface. Images from a simulated tuning-fork based microscope as well as a standard topographical sample are also presented, showing some of the capabilities of the software.
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors

NASA Astrophysics Data System (ADS)

Yi, Hongsuk

2014-03-01

Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
System and method for representing and manipulating three-dimensional objects on massively parallel architectures

DOEpatents

Karasick, M.S.; Strip, D.R.

1996-01-30

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.
CMS Readiness for Multi-Core Workload Scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides amore » solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.« less
CMS readiness for multi-core workload scheduling

NASA Astrophysics Data System (ADS)

Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

2017-10-01

In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.
AltiVec performance increases for autonomous robotics for the MARSSCAPE architecture program

NASA Astrophysics Data System (ADS)

Gothard, Benny M.

2002-02-01

One of the main tall poles that must be overcome to develop a fully autonomous vehicle is the inability of the computer to understand its surrounding environment to a level that is required for the intended task. The military mission scenario requires a robot to interact in a complex, unstructured, dynamic environment. Reference A High Fidelity Multi-Sensor Scene Understanding System for Autonomous Navigation The Mobile Autonomous Robot Software Self Composing Adaptive Programming Environment (MarsScape) perception research addresses three aspects of the problem; sensor system design, processing architectures, and algorithm enhancements. A prototype perception system has been demonstrated on robotic High Mobility Multi-purpose Wheeled Vehicle and All Terrain Vehicle testbeds. This paper addresses the tall pole of processing requirements and the performance improvements based on the selected MarsScape Processing Architecture. The processor chosen is the Motorola Altivec-G4 Power PC(PPC) (1998 Motorola, Inc.), a highly parallized commercial Single Instruction Multiple Data processor. Both derived perception benchmarks and actual perception subsystems code will be benchmarked and compared against previous Demo II-Semi-autonomous Surrogate Vehicle processing architectures along with desktop Personal Computers(PC). Performance gains are highlighted with progress to date, and lessons learned and future directions are described.
Advanced Avionics and Processor Systems for a Flexible Space Exploration Architecture

NASA Technical Reports Server (NTRS)

Keys, Andrew S.; Adams, James H.; Smith, Leigh M.; Johnson, Michael A.; Cressler, John D.

2010-01-01

The Advanced Avionics and Processor Systems (AAPS) project, formerly known as the Radiation Hardened Electronics for Space Environments (RHESE) project, endeavors to develop advanced avionic and processor technologies anticipated to be used by NASA s currently evolving space exploration architectures. The AAPS project is a part of the Exploration Technology Development Program, which funds an entire suite of technologies that are aimed at enabling NASA s ability to explore beyond low earth orbit. NASA s Marshall Space Flight Center (MSFC) manages the AAPS project. AAPS uses a broad-scoped approach to developing avionic and processor systems. Investment areas include advanced electronic designs and technologies capable of providing environmental hardness, reconfigurable computing techniques, software tools for radiation effects assessment, and radiation environment modeling tools. Near-term emphasis within the multiple AAPS tasks focuses on developing prototype components using semiconductor processes and materials (such as Silicon-Germanium (SiGe)) to enhance a device s tolerance to radiation events and low temperature environments. As the SiGe technology will culminate in a delivered prototype this fiscal year, the project emphasis shifts its focus to developing low-power, high efficiency total processor hardening techniques. In addition to processor development, the project endeavors to demonstrate techniques applicable to reconfigurable computing and partially reconfigurable Field Programmable Gate Arrays (FPGAs). This capability enables avionic architectures the ability to develop FPGA-based, radiation tolerant processor boards that can serve in multiple physical locations throughout the spacecraft and perform multiple functions during the course of the mission. The individual tasks that comprise AAPS are diverse, yet united in the common endeavor to develop electronics capable of operating within the harsh environment of space. Specifically, the AAPS tasks for the Federal fiscal year of 2010 are: Silicon-Germanium (SiGe) Integrated Electronics for Extreme Environments, Modeling of Radiation Effects on Electronics, Radiation Hardened High Performance Processors (HPP), and and Reconfigurable Computing.
Flight Computer Design for the Space Technology 5 (ST-5) Mission

NASA Technical Reports Server (NTRS)

Speer, David; Jackson, George; Raphael, Dave; Day, John H. (Technical Monitor)

2001-01-01

As part of NASA's New Millennium Program, the Space Technology 5 mission will validate a variety of technologies for nano-satellite and constellation mission applications. Included are: a miniaturized and low power X-band transponder, a constellation communication and navigation transceiver, a cold gas micro-thruster, two different variable emittance (thermal) controllers, flex cables for solar array power collection, autonomous groundbased constellation management tools, and a new CMOS ultra low-power, radiation-tolerant, +0.5 volt logic technology. The ST-5 focus is on small and low-power. A single-processor, multi-function flight computer will implement direct digital and analog interfaces to all of the other spacecraft subsystems and components. There will not be a distributed data system that uses a standardized serial bus such as MIL-STD-1553 or MIL-STD-1773. The flight software running on the single processor will be responsible for all real-time processing associated with: guidance, navigation and control, command and data handling (C&DH) including uplink/downlink, power switching and battery charge management, science data analysis and storage, intra-constellation communications, and housekeeping data collection and logging. As a nanosatellite trail-blazer for future constellations of up to 100 separate space vehicles, ST-5 will demonstrate a compact (single board), low power (5.5 watts) solution to the data acquisition, control, communications, processing and storage requirements that have traditionally required an entire network of separate circuit boards and/or avionics boxes. In addition to the New Millennium technologies, other major spacecraft subsystems include the power system electronics, a lithium-ion battery, triple-junction solar cell arrays, a science-grade magnetometer, a miniature spinning sun sensor, and a propulsion system.
Increasing processor utilization during parallel computation rundown

NASA Technical Reports Server (NTRS)

Jones, W. H.

1986-01-01

Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.
Development of a new signal processor for tetralateral position sensitive detector based on single-chip microcomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huang Meizhen; Shi Longzhao; Wang Yuxing

2006-08-15

An inherently nonlinear relation between the output current of the tetralateral position sensitive detector (PSD) and the position of the incident light spot has been found theoretically. Based on single-chip microcomputer and the theoretical relation between output current and position, a new signal processor capable of correcting nonlinearity and reducing position measurement deviation of tetralateral PSD was developed. A tetralateral PSD (S1200, 13x13 mm{sup 2}, Hamamatsu Photonics K.K.) was measured with the new signal processor, a linear relation between the output position of the PSD, and the incident position of the light spot was obtained. In the 60% range ofmore » a 13x13 mm{sup 2} active area, the position nonlinearity (rms) was 0.15% and the position measurement deviation (rms) was {+-}20 {mu}m. Compared with traditional analog signal processor, the new signal processor is of better compatibility, lower cost, higher precision, and easier to be interfaced.« less
Development of a new signal processor for tetralateral position sensitive detector based on single-chip microcomputer

NASA Astrophysics Data System (ADS)

Huang, Mei-Zhen; Shi, Long-Zhao; Wang, Yu-Xing; Ni, Yi; Li, Zhen-Qing; Ding, Hai-Feng

2006-08-01

An inherently nonlinear relation between the output current of the tetralateral position sensitive detector (PSD) and the position of the incident light spot has been found theoretically. Based on single-chip microcomputer and the theoretical relation between output current and position, a new signal processor capable of correcting nonlinearity and reducing position measurement deviation of tetralateral PSD was developed. A tetralateral PSD (S1200, 13×13mm2, Hamamatsu Photonics K.K.) was measured with the new signal processor, a linear relation between the output position of the PSD, and the incident position of the light spot was obtained. In the 60% range of a 13×13mm2 active area, the position nonlinearity (rms) was 0.15% and the position measurement deviation (rms) was ±20μm. Compared with traditional analog signal processor, the new signal processor is of better compatibility, lower cost, higher precision, and easier to be interfaced.
The scaling issue: scientific opportunities

NASA Astrophysics Data System (ADS)

Orbach, Raymond L.

2009-07-01

A brief history of the Leadership Computing Facility (LCF) initiative is presented, along with the importance of SciDAC to the initiative. The initiative led to the initiation of the Innovative and Novel Computational Impact on Theory and Experiment program (INCITE), open to all researchers in the US and abroad, and based solely on scientific merit through peer review, awarding sizeable allocations (typically millions of processor-hours per project). The development of the nation's LCFs has enabled available INCITE processor-hours to double roughly every eight months since its inception in 2004. The 'top ten' LCF accomplishments in 2009 illustrate the breadth of the scientific program, while the 75 million processor hours allocated to American business since 2006 highlight INCITE contributions to US competitiveness. The extrapolation of INCITE processor hours into the future brings new possibilities for many 'classic' scaling problems. Complex systems and atomic displacements to cracks are but two examples. However, even with increasing computational speeds, the development of theory, numerical representations, algorithms, and efficient implementation are required for substantial success, exhibiting the crucial role that SciDAC will play.
a Real-Time Computer Music Synthesis System

NASA Astrophysics Data System (ADS)

Lent, Keith Henry

A real time sound synthesis system has been developed at the Computer Music Center of The University of Texas at Austin. This system consists of several stand alone processors that were constructed jointly with White Instruments in Austin. These processors can be programmed as general purpose computers, but are provided with a number of specialized interfaces including: MIDI, 8 bit parallel, high speed serial, 2 channels analog input (18 bit A/Ds, 48kHz sample rate), and 4 channels analog output (18 bit D/As). In addition, a basic music synthesis language (Music56000) has been written in assembly code. On top of this, a symbolic compiler (PatchWork) has been developed to enable algorithms which run in these processors to be created graphically. And finally, a number of efficient time domain numerical models have been developed to enable the construction, simulation, control, and synthesis of many musical acoustics systems in real time on these processors. Specifically, assembly language models for cylindrical and conical horn sections, dissipative losses, tone holes, bells, and a number of linear and nonlinear boundary conditions have been developed.

3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

1997-12-31

The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
Assessing the Progress of Trapped-Ion Processors Towards Fault-Tolerant Quantum Computation

NASA Astrophysics Data System (ADS)

Bermudez, A.; Xu, X.; Nigmatullin, R.; O'Gorman, J.; Negnevitsky, V.; Schindler, P.; Monz, T.; Poschinger, U. G.; Hempel, C.; Home, J.; Schmidt-Kaler, F.; Biercuk, M.; Blatt, R.; Benjamin, S.; Müller, M.

2017-10-01

A quantitative assessment of the progress of small prototype quantum processors towards fault-tolerant quantum computation is a problem of current interest in experimental and theoretical quantum information science. We introduce a necessary and fair criterion for quantum error correction (QEC), which must be achieved in the development of these quantum processors before their sizes are sufficiently big to consider the well-known QEC threshold. We apply this criterion to benchmark the ongoing effort in implementing QEC with topological color codes using trapped-ion quantum processors and, more importantly, to guide the future hardware developments that will be required in order to demonstrate beneficial QEC with small topological quantum codes. In doing so, we present a thorough description of a realistic trapped-ion toolbox for QEC and a physically motivated error model that goes beyond standard simplifications in the QEC literature. We focus on laser-based quantum gates realized in two-species trapped-ion crystals in high-optical aperture segmented traps. Our large-scale numerical analysis shows that, with the foreseen technological improvements described here, this platform is a very promising candidate for fault-tolerant quantum computation.
Strong coupling of a single electron in silicon to a microwave photon

NASA Astrophysics Data System (ADS)

Mi, X.; Cady, J. V.; Zajac, D. M.; Deelman, P. W.; Petta, J. R.

2017-01-01

Silicon is vital to the computing industry because of the high quality of its native oxide and well-established doping technologies. Isotopic purification has enabled quantum coherence times on the order of seconds, thereby placing silicon at the forefront of efforts to create a solid-state quantum processor. We demonstrate strong coupling of a single electron in a silicon double quantum dot to the photonic field of a microwave cavity, as shown by the observation of vacuum Rabi splitting. Strong coupling of a quantum dot electron to a cavity photon would allow for long-range qubit coupling and the long-range entanglement of electrons in semiconductor quantum dots.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, R.E.; Gustafson, J.L.; Montry, G.R.

1999-08-10

A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1999-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
A class of parallel algorithms for computation of the manipulator inertia matrix

NASA Technical Reports Server (NTRS)

Fijany, Amir; Bejczy, Antal K.

1989-01-01

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
FAST: framework for heterogeneous medical image computing and visualization.

PubMed

Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank

2015-11-01

Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.
Picoradio: Communication/Computation Piconodes for Sensor Networks

DTIC Science & Technology

2003-01-02

diagram of PicoNode III, or Quark node. It is made from two custom chips, Strange RF and Charm digital processor , and is complemented by a set of...the chipset comprising of Strange (analog OOK transceiver) and Charm (digital processor ) chips. 44 Figure 33: System block diagram of the Quark node...19 2.B PICONODE II - TWO-CHIP PICONODE IMPLEMENTATION ......................................... 21 2.B.1 Baseband processor (BBP
Northeast Parallel Architectures Center (NPAC)

DTIC Science & Technology

1992-07-01

Computational Techniques: Mapping receptor units to processors , using NEWS communication to model interaction in the inhibitory field Goal of the Research...algorithms for classical problems to take advantage of multiple processors . Experiments in probability that have been too time consuming on serial...machine and achieved speedups of 4 to 5 times with 11 processors . It is believed that a slightly better speedup is achievable. In the case of stuck
Superconducting quantum circuits at the surface code threshold for fault tolerance.

PubMed

Barends, R; Kelly, J; Megrant, A; Veitia, A; Sank, D; Jeffrey, E; White, T C; Mutus, J; Fowler, A G; Campbell, B; Chen, Y; Chen, Z; Chiaro, B; Dunsworth, A; Neill, C; O'Malley, P; Roushan, P; Vainsencher, A; Wenner, J; Korotkov, A N; Cleland, A N; Martinis, John M

2014-04-24

A quantum computer can solve hard problems, such as prime factoring, database searching and quantum simulation, at the cost of needing to protect fragile quantum states from error. Quantum error correction provides this protection by distributing a logical state among many physical quantum bits (qubits) by means of quantum entanglement. Superconductivity is a useful phenomenon in this regard, because it allows the construction of large quantum circuits and is compatible with microfabrication. For superconducting qubits, the surface code approach to quantum computing is a natural choice for error correction, because it uses only nearest-neighbour coupling and rapidly cycled entangling gates. The gate fidelity requirements are modest: the per-step fidelity threshold is only about 99 per cent. Here we demonstrate a universal set of logic gates in a superconducting multi-qubit processor, achieving an average single-qubit gate fidelity of 99.92 per cent and a two-qubit gate fidelity of up to 99.4 per cent. This places Josephson quantum computing at the fault-tolerance threshold for surface code error correction. Our quantum processor is a first step towards the surface code, using five qubits arranged in a linear array with nearest-neighbour coupling. As a further demonstration, we construct a five-qubit Greenberger-Horne-Zeilinger state using the complete circuit and full set of gates. The results demonstrate that Josephson quantum computing is a high-fidelity technology, with a clear path to scaling up to large-scale, fault-tolerant quantum circuits.
Implementation and Assessment of Advanced Analog Vector-Matrix Processor

NASA Technical Reports Server (NTRS)

Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the processor is compared with that required for a range of typical problems. Calculations resolving alloy concentrations from spectral plume data of rocket engines are implemented on the optical processor, demonstrating its sufficiency for this problem. We also show how this technology can be easily extended to a 100 x 100 10 MHz (200 Cops) processor.
Networked Workstations and Parallel Processing Utilizing Functional Languages

DTIC Science & Technology

1993-03-01

program . This frees the programmer to concentrate on what the program is to do, not how the program is...traditional ’von Neumann’ architecture uses a timer based (e.g., the program counter), sequentially pro- grammed, single processor approach to problem...traditional ’von Neumann’ architecture uses a timer based (e.g., the program counter), sequentially programmed , single processor approach to
Parallel Semi-Implicit Spectral Element Atmospheric Model

NASA Astrophysics Data System (ADS)

Fournier, A.; Thomas, S.; Loft, R.

2001-05-01

The shallow-water equations (SWE) have long been used to test atmospheric-modeling numerical methods. The SWE contain essential wave-propagation and nonlinear effects of more complete models. We present a semi-implicit (SI) improvement of the Spectral Element Atmospheric Model to solve the SWE (SEAM, Taylor et al. 1997, Fournier et al. 2000, Thomas & Loft 2000). SE methods are h-p finite element methods combining the geometric flexibility of size-h finite elements with the accuracy of degree-p spectral methods. Our work suggests that exceptional parallel-computation performance is achievable by a General-Circulation-Model (GCM) dynamical core, even at modest climate-simulation resolutions (>1o). The code derivation involves weak variational formulation of the SWE, Gauss(-Lobatto) quadrature over the collocation points, and Legendre cardinal interpolators. Appropriate weak variation yields a symmetric positive-definite Helmholtz operator. To meet the Ladyzhenskaya-Babuska-Brezzi inf-sup condition and avoid spurious modes, we use a staggered grid. The SI scheme combines leapfrog and Crank-Nicholson schemes for the nonlinear and linear terms respectively. The localization of operations to elements ideally fits the method to cache-based microprocessor computer architectures --derivatives are computed as collections of small (8x8), naturally cache-blocked matrix-vector products. SEAM also has desirable boundary-exchange communication, like finite-difference models. Timings on on the IBM SP and Compaq ES40 supercomputers indicate that the SI code (20-min timestep) requires 1/3 the CPU time of the explicit code (2-min timestep) for T42 resolutions. Both codes scale nearly linearly out to 400 processors. We achieved single-processor performance up to 30% of peak for both codes on the 375-MHz IBM Power-3 processors. Fast computation and linear scaling lead to a useful climate-simulation dycore only if enough model time is computed per unit wall-clock time. An efficient SI solver is essential to substantially increase this rate. Parallel preconditioning for an iterative conjugate-gradient elliptic solver is described. We are building a GCM dycore capable of 200 GF% lOPS sustained performance on clustered RISC/cache architectures using hybrid MPI/OpenMP programming.
Digital algorithms for parallel pipelined single-detector homodyne fringe counting in laser interferometry

NASA Astrophysics Data System (ADS)

Rerucha, Simon; Sarbort, Martin; Hola, Miroslava; Cizek, Martin; Hucl, Vaclav; Cip, Ondrej; Lazar, Josef

2016-12-01

The homodyne detection with only a single detector represents a promising approach in the interferometric application which enables a significant reduction of the optical system complexity while preserving the fundamental resolution and dynamic range of the single frequency laser interferometers. We present the design, implementation and analysis of algorithmic methods for computational processing of the single-detector interference signal based on parallel pipelined processing suitable for real time implementation on a programmable hardware platform (e.g. the FPGA - Field Programmable Gate Arrays or the SoC - System on Chip). The algorithmic methods incorporate (a) the single detector signal (sine) scaling, filtering, demodulations and mixing necessary for the second (cosine) quadrature signal reconstruction followed by a conic section projection in Cartesian plane as well as (a) the phase unwrapping together with the goniometric and linear transformations needed for the scale linearization and periodic error correction. The digital computing scheme was designed for bandwidths up to tens of megahertz which would allow to measure the displacements at the velocities around half metre per second. The algorithmic methods were tested in real-time operation with a PC-based reference implementation that employed the advantage pipelined processing by balancing the computational load among multiple processor cores. The results indicate that the algorithmic methods are suitable for a wide range of applications [3] and that they are bringing the fringe counting interferometry closer to the industrial applications due to their optical setup simplicity and robustness, computational stability, scalability and also a cost-effectiveness.
Optical Flow in a Smart Sensor Based on Hybrid Analog-Digital Architecture

PubMed Central

Guzmán, Pablo; Díaz, Javier; Agís, Rodrigo; Ros, Eduardo

2010-01-01

The purpose of this study is to develop a motion sensor (delivering optical flow estimations) using a platform that includes the sensor itself, focal plane processing resources, and co-processing resources on a general purpose embedded processor. All this is implemented on a single device as a SoC (System-on-a-Chip). Optical flow is the 2-D projection into the camera plane of the 3-D motion information presented at the world scenario. This motion representation is widespread well-known and applied in the science community to solve a wide variety of problems. Most applications based on motion estimation require work in real-time; hence, this restriction must be taken into account. In this paper, we show an efficient approach to estimate the motion velocity vectors with an architecture based on a focal plane processor combined on-chip with a 32 bits NIOS II processor. Our approach relies on the simplification of the original optical flow model and its efficient implementation in a platform that combines an analog (focal-plane) and digital (NIOS II) processor. The system is fully functional and is organized in different stages where the early processing (focal plane) stage is mainly focus to pre-process the input image stream to reduce the computational cost in the post-processing (NIOS II) stage. We present the employed co-design techniques and analyze this novel architecture. We evaluate the system’s performance and accuracy with respect to the different proposed approaches described in the literature. We also discuss the advantages of the proposed approach as well as the degree of efficiency which can be obtained from the focal plane processing capabilities of the system. The final outcome is a low cost smart sensor for optical flow computation with real-time performance and reduced power consumption that can be used for very diverse application domains. PMID:22319283
Parallelization of combinatorial search when solving knapsack optimization problem on computing systems based on multicore processors

NASA Astrophysics Data System (ADS)

Rahman, P. A.

2018-05-01

This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
System and method for programmable bank selection for banked memory subsystems

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan

2010-09-07

A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
A high-speed digital signal processor for atmospheric radar, part 7.3A

NASA Technical Reports Server (NTRS)

Brosnahan, J. W.; Woodard, D. M.

1984-01-01

The Model SP-320 device is a monolithic realization of a complex general purpose signal processor, incorporating such features as a 32-bit ALU, a 16-bit x 16-bit combinatorial multiplier, and a 16-bit barrel shifter. The SP-320 is designed to operate as a slave processor to a host general purpose computer in applications such as coherent integration of a radar return signal in multiple ranges, or dedicated FFT processing. Presently available is an I/O module conforming to the Intel Multichannel interface standard; other I/O modules will be designed to meet specific user requirements. The main processor board includes input and output FIFO (First In First Out) memories, both with depths of 4096 W, to permit asynchronous operation between the source of data and the host computer. This design permits burst data rates in excess of 5 MW/s.
Quantum state transfer and controlled-phase gate on one-dimensional superconducting resonators assisted by a quantum bus.

PubMed

Hua, Ming; Tao, Ming-Jie; Deng, Fu-Guo

2016-02-24

We propose a quantum processor for the scalable quantum computation on microwave photons in distant one-dimensional superconducting resonators. It is composed of a common resonator R acting as a quantum bus and some distant resonators rj coupled to the bus in different positions assisted by superconducting quantum interferometer devices (SQUID), different from previous processors. R is coupled to one transmon qutrit, and the coupling strengths between rj and R can be fully tuned by the external flux through the SQUID. To show the processor can be used to achieve universal quantum computation effectively, we present a scheme to complete the high-fidelity quantum state transfer between two distant microwave-photon resonators and another one for the high-fidelity controlled-phase gate on them. By using the technique for catching and releasing the microwave photons from resonators, our processor may play an important role in quantum communication as well.
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

NASA Astrophysics Data System (ADS)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.

Novel hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization estimation method for population pharmacokinetic data analysis.

PubMed

Ng, C M

2013-10-01

The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Realization of a single image haze removal system based on DaVinci DM6467T processor

NASA Astrophysics Data System (ADS)

Liu, Zhuang

2014-10-01

Video monitoring system (VMS) has been extensively applied in domains of target recognition, traffic management, remote sensing, auto navigation and national defence. However the VMS has a strong dependence on the weather, for instance, in foggy weather, the quality of images received by the VMS are distinct degraded and the effective range of VMS is also decreased. All in all, the VMS performs terribly in bad weather. Thus the research of fog degraded images enhancement has very high theoretical and practical application value. A design scheme of a fog degraded images enhancement system based on the TI DaVinci processor is presented in this paper. The main function of the referred system is to extract and digital cameras capture images and execute image enhancement processing to obtain a clear image. The processor used in this system is the dual core TI DaVinci DM6467T - ARM@500MHz+DSP@1GH. A MontaVista Linux operating system is running on the ARM subsystem which handles I/O and application processing. The DSP handles signal processing and the results are available to the ARM subsystem in shared memory.The system benefits from the DaVinci processor so that, with lower power cost and smaller volume, it provides the equivalent image processing capability of a X86 computer. The outcome shows that the system in this paper can process images at 25 frames per second on D1 resolution.
Federal Plan for High-End Computing. Report of the High-End Computing Revitalization Task Force (HECRTF)

DTIC Science & Technology

2004-07-01

steadily for the past fifteen years, while memory latency and bandwidth have improved much more slowly. For example, Intel processor clock rates38 have... processor and memory performance) all greatly restrict the ability to achieve high levels of performance for science, engineering, and national...sub-nuclear distances. Guide experiments to identify transition from quantum chromodynamics to quark -gluon plasma. Accelerator Physics Accurate
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G.; Salapura, Valentina

2012-07-24

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Common Board Design for the OBC I/O Unit and The OBC CCSDS Unit of The Stuttgart University Satellite "Flying Laptop"

NASA Astrophysics Data System (ADS)

Eickhoff, Jens; Cook, Barry; Walker, Paul; Habinc, Sadi; Witt, Rouven; Roser, Hans-Peter

2011-08-01

As already published in another paper at DASIA 2010 in Budapest [1] the University of Stuttgart, Germany, is developing an advanced 3-axis stabilized small satellite applying industry standards for command/control techniques, onboard software design and onboard computer components.The satellite has a launch mass of approx. 120kg and is foreseen to be launched end 2013 as piggy back payload on an Indian PSLV launcher.During phase C the main challenge was the conceptual design for an ultra compact and performant onboard computer (OBC), which is able to support an industry standard operating system, a PUS standard based onboard software (OBSW) and CCSDS standard based ground/space communication. The developed architecture is based on 4 main elements (see [1] and Figure 4):• the OBC core board (single board computer based on LEON3 FT architecture),• an I/O Board for all OBC digital interfaces to S/C equipment,• a CCSDS TC/TM pre-processor board,• CPDU being embedded in the PCDU.The EM for the OBC core meanwhile has been shipped to the University by the supplier Aeroflex Colorado Springs, USA and is in use in Stuttgart since January 2011. Figure 2 and Figure 3 provide brief impressions. This paper concentrates on the common design of the I/O board and the CCSDS processor boards.
Load Balancing Unstructured Adaptive Grids for CFD Problems

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid

1996-01-01

Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor

NASA Technical Reports Server (NTRS)

Gilliland, M. C.; Smith, B. J.; Calvert, W.

1976-01-01

The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
The One Computer Classroom: Applications in Language Arts.

ERIC Educational Resources Information Center

Tobin, Kenneth; Tobin, Barbara

1985-01-01

Describes a study showing that primary grade students can benefit from using a word processor in activities that involve the creation of text. Suggests activities that further extend the value of the word processor in language arts instruction. (EL)
Baseband processor development for the Advanced Communications Satellite Program

NASA Technical Reports Server (NTRS)

Moat, D.; Sabourin, D.; Stilwell, J.; Mccallister, R.; Borota, M.

1982-01-01

An onboard-baseband-processor concept for a satellite-switched time-division-multiple-access (SS-TDMA) communication system was developed for NASA Lewis Research Center. The baseband processor routes and controls traffic on an individual message basis while providing significant advantages in improved link margins and system flexibility. Key technology developments required to prove the flight readiness of the baseband-processor design are being verified in a baseband-processor proof-of-concept model. These technology developments include serial MSK modems, Clos-type baseband routing switch, a single-chip CMOS maximum-likelihood convolutional decoder, and custom LSL implementation of high-speed, low-power ECL building blocks.
Measurement of fault latency in a digital avionic mini processor, part 2

NASA Technical Reports Server (NTRS)

Mcgough, J.; Swern, F.

1983-01-01

The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are described. Several earlier programs were reprogrammed, expanding the instruction set to capitalize on the full power of the BDX-930 computer. As a final demonstration of fault coverage an extensive, 3-axis, high performance flght control computation was added. The stages in the development of a CPU self-test program emphasizing the relationship between fault coverage, speed, and quantity of instructions were demonstrated.
The development of an engineering computer graphics laboratory

NASA Technical Reports Server (NTRS)

Anderson, D. C.; Garrett, R. E.

1975-01-01

Hardware and software systems developed to further research and education in interactive computer graphics were described, as well as several of the ongoing application-oriented projects, educational graphics programs, and graduate research projects. The software system consists of a FORTRAN 4 subroutine package, in conjunction with a PDP 11/40 minicomputer as the primary computation processor and the Imlac PDS-1 as an intelligent display processor. The package comprises a comprehensive set of graphics routines for dynamic, structured two-dimensional display manipulation, and numerous routines to handle a variety of input devices at the Imlac.
Benchmarking gate-based quantum computers

NASA Astrophysics Data System (ADS)

Michielsen, Kristel; Nocon, Madita; Willsch, Dennis; Jin, Fengping; Lippert, Thomas; De Raedt, Hans

2017-11-01

With the advent of public access to small gate-based quantum processors, it becomes necessary to develop a benchmarking methodology such that independent researchers can validate the operation of these processors. We explore the usefulness of a number of simple quantum circuits as benchmarks for gate-based quantum computing devices and show that circuits performing identity operations are very simple, scalable and sensitive to gate errors and are therefore very well suited for this task. We illustrate the procedure by presenting benchmark results for the IBM Quantum Experience, a cloud-based platform for gate-based quantum computing.
Efficient Helicopter Aerodynamic and Aeroacoustic Predictions on Parallel Computers

NASA Technical Reports Server (NTRS)

Wissink, Andrew M.; Lyrintzis, Anastasios S.; Strawn, Roger C.; Oliker, Leonid; Biswas, Rupak

1996-01-01

This paper presents parallel implementations of two codes used in a combined CFD/Kirchhoff methodology to predict the aerodynamics and aeroacoustics properties of helicopters. The rotorcraft Navier-Stokes code, TURNS, computes the aerodynamic flowfield near the helicopter blades and the Kirchhoff acoustics code computes the noise in the far field, using the TURNS solution as input. The overall parallel strategy adds MPI message passing calls to the existing serial codes to allow for communication between processors. As a result, the total code modifications required for parallel execution are relatively small. The biggest bottleneck in running the TURNS code in parallel comes from the LU-SGS algorithm that solves the implicit system of equations. We use a new hybrid domain decomposition implementation of LU-SGS to obtain good parallel performance on the SP-2. TURNS demonstrates excellent parallel speedups for quasi-steady and unsteady three-dimensional calculations of a helicopter blade in forward flight. The execution rate attained by the code on 114 processors is six times faster than the same cases run on one processor of the Cray C-90. The parallel Kirchhoff code also shows excellent parallel speedups and fast execution rates. As a performance demonstration, unsteady acoustic pressures are computed at 1886 far-field observer locations for a sample acoustics problem. The calculation requires over two hundred hours of CPU time on one C-90 processor but takes only a few hours on 80 processors of the SP2. The resultant far-field acoustic field is analyzed with state of-the-art audio and video rendering of the propagating acoustic signals.
Design and implementation of a medium speed communications interface and protocol for a low cost, refreshed display computer

NASA Technical Reports Server (NTRS)

Phyne, J. R.; Nelson, M. D.

1975-01-01

The design and implementation of hardware and software systems involved in using a 40,000 bit/second communication line as the connecting link between an IMLAC PDS 1-D display computer and a Univac 1108 computer system were described. The IMLAC consists of two independent processors sharing a common memory. The display processor generates the deflection and beam control currents as it interprets a program contained in the memory; the minicomputer has a general instruction set and is responsible for starting and stopping the display processor and for communicating with the outside world through the keyboard, teletype, light pen, and communication line. The processing time associated with each data byte was minimized by designing the input and output processes as finite state machines which automatically sequence from each state to the next. Several tests of the communication link and the IMLAC software were made using a special low capacity computer grade cable between the IMLAC and the Univac.
Underwater Threat Source Localization: Processing Sensor Network TDOAs with a Terascale Optical Core Device

DOE Office of Scientific and Technical Information (OSTI.GOV)

Barhen, Jacob; Imam, Neena

2007-01-01

Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less
Computational Issues in Damping Identification for Large Scale Problems

NASA Technical Reports Server (NTRS)

Pilkey, Deborah L.; Roe, Kevin P.; Inman, Daniel J.

1997-01-01

Two damping identification methods are tested for efficiency in large-scale applications. One is an iterative routine, and the other a least squares method. Numerical simulations have been performed on multiple degree-of-freedom models to test the effectiveness of the algorithm and the usefulness of parallel computation for the problems. High Performance Fortran is used to parallelize the algorithm. Tests were performed using the IBM-SP2 at NASA Ames Research Center. The least squares method tested incurs high communication costs, which reduces the benefit of high performance computing. This method's memory requirement grows at a very rapid rate meaning that larger problems can quickly exceed available computer memory. The iterative method's memory requirement grows at a much slower pace and is able to handle problems with 500+ degrees of freedom on a single processor. This method benefits from parallelization, and significant speedup can he seen for problems of 100+ degrees-of-freedom.
An observatory control system for the University of Hawai'i 2.2m Telescope

NASA Astrophysics Data System (ADS)

McKay, Luke; Erickson, Christopher; Mukensnable, Donn; Stearman, Anthony; Straight, Brad

2016-07-01

The University of Hawai'i 2.2m telescope at Maunakea has operated since 1970, and has had several controls upgrades to date. The newest system will operate as a distributed hierarchy of GNU/Linux central server, networked single-board computers, microcontrollers, and a modular motion control processor for the main axes. Rather than just a telescope control system, this new effort is towards a cohesive, modular, and robust whole observatory control system, with design goals of fully robotic unattended operation, high reliability, and ease of maintenance and upgrade.
Execution of parallel algorithms on a heterogeneous multicomputer

NASA Astrophysics Data System (ADS)

Isenstein, Barry S.; Greene, Jonathon

1995-04-01

Many aerospace/defense sensing and dual-use applications require high-performance computing, extensive high-bandwidth interconnect and realtime deterministic operation. This paper will describe the architecture of a scalable multicomputer that includes DSP and RISC processors. A single chassis implementation is capable of delivering in excess of 10 GFLOPS of DSP processing power with 2 Gbytes/s of realtime sensor I/O. A software approach to implementing parallel algorithms called the Parallel Application System (PAS) is also presented. An example of applying PAS to a DSP application is shown.
Large-Constraint-Length, Fast Viterbi Decoder

NASA Technical Reports Server (NTRS)

Collins, O.; Dolinar, S.; Hsu, In-Shek; Pollara, F.; Olson, E.; Statman, J.; Zimmerman, G.

1990-01-01

Scheme for efficient interconnection makes VLSI design feasible. Concept for fast Viterbi decoder provides for processing of convolutional codes of constraint length K up to 15 and rates of 1/2 to 1/6. Fully parallel (but bit-serial) architecture developed for decoder of K = 7 implemented in single dedicated VLSI circuit chip. Contains six major functional blocks. VLSI circuits perform branch metric computations, add-compare-select operations, and then store decisions in traceback memory. Traceback processor reads appropriate memory locations and puts out decoded bits. Used as building block for decoders of larger K.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques Held 24-26 August 1994 in Montreal, Canada

DTIC Science & Technology

1994-08-26

an Itegrated Circuit Global Router. In Proc. of PPEARS 88, pages 138-145, 1988. [7] S. Sakai, Y. Yamaguchi, K. Hiraki , Y. Kodama, and T. Yuba. An...Computer Architecture, 1992. [5] S. Sakai, Y. Yamaguchi, K. Hiraki , Y. Kodama, and T. Yuba. An architecture of a data-flow single chip processor. In Int...EM-4 and sparing time for tech- nical discussions. We also thank Prof. Kei Hiraki at the Univ. of Tokyo for his helpful comments. Hidehiko Masuhara’s

Some links on this page may take you to non-federal websites. Their policies may differ from this site.