Science.gov

Sample records for parallel processor array

  1. Integration of IR focal plane arrays with massively parallel processor

    NASA Astrophysics Data System (ADS)

    Esfandiari, P.; Koskey, P.; Vaccaro, K.; Buchwald, W.; Clark, F.; Krejca, B.; Rekeczky, C.; Zarandy, A.

    2008-04-01

    The intent of this investigation is to replace the low fill factor visible sensor of a Cellular Neural Network (CNN) processor with an InGaAs Focal Plane Array (FPA) using both bump bonding and epitaxial layer transfer techniques for use in the Ballistic Missile Defense System (BMDS) interceptor seekers. The goal is to fabricate a massively parallel digital processor with a local as well as a global interconnect architecture. Currently, this unique CNN processor is capable of processing a target scene in excess of 10,000 frames per second with its visible sensor. What makes the CNN processor so unique is that each processing element includes memory, local data storage, local and global communication devices and a visible sensor supported by a programmable analog or digital computer program.

  2. Digital Parallel Processor Array for Optimum Path Planning

    NASA Technical Reports Server (NTRS)

    Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

    1996-01-01

    The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

  3. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  4. Fast parallel implementation of multidimensional data-domain FORTRAN codes on distributed-memory processor arrays

    NASA Astrophysics Data System (ADS)

    Reale, F.; Barbera, M.; Sciortino, S.

    1992-11-01

    We illustrate a general and straightforward approach to develop FORTRAN parallel two-dimensional data-domain applications on distributed-memory systems, such as those based on transputers. We have aimed at achieving flexibility for different processor topologies and processor numbers, non-homogeneous processor configurations and coarse load-balancing. We have assumed a master-slave architecture as basic programming model in the framework of a domain decomposition approach. After developing a library of high-level general network and communication routines, based on low-level system-dependent libraries, we have used it to parallelize some specific applications: an elementary 2-D code, useful as a pattern and guide for other more complex applications, and a 2-D hydrodynamic code for astrophysical studies. Code parallelization is achieved by splitting the original code into two independent codes, one for the master and the other for the slaves, and then by adding coordinated calls to network setting and message-passing routines into the programs. The parallel applications have been implemented on a Meiko Computing Surface hosted by a SUN 4 workstation and running CSTools software package. After the basic network and communication routines were developed, the task of parallelizing the 2-D hydrodynamic code took approximately 12 man hours. The parallel efficiency of the code ranges between 98% and 58% on arrays between 2 and 20 T800 transputers, on a relatively small computational mesh (≈3000 cells). Arrays consisting of a limited number of faster Intel i860 processors achieve a high parallel efficiency on large computational grids (> 10000 grid points) with performances in the class of minisupercomputers.

  5. Array processor architecture

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1983-01-01

    A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

  6. Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

    NASA Astrophysics Data System (ADS)

    Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

    2008-04-01

    This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.

  7. Optical logic array processor

    SciTech Connect

    Tanida, J.; Ichioka, Y.

    1983-01-01

    A simple method for optically implementing digital logic gates in parallel has been developed. Parallel logic gates can be achieved by using a lensless shadow-casting system with a light emitting diode array as an incoherent light source. All the sixteen logic functions for two binary variables, which are the fundamental computations of Boolean algebra, can be simply realised in parallel with these gates by changing the switching modes of a led array. Parallel computation structures of the developed optical digital array processor are demonstrated by implementing pattern logics for two binary images with high space-bandwidth product. Applications of the proposed method to parallel shift operation of the image, differentiation, and processing of gray-level image are shown. 9 references.

  8. Computational cost of image registration with a parallel binary array processor

    SciTech Connect

    Reeves, A.P.; Rostampour, A.

    1982-07-01

    The application of a simulated binary array processor (BAP) to the rapid analysis of a sequence of images has been studied. Several algorithms have been developed which may be implemented on many existing parallel processing machines. The characteristic operations of a BAP are discussed and analyzed. A set of preprocessing algorithms are described which are designed to register two images of tv-type video data in real time. These algorithms illustrate the potential uses of a BAP and their cost is analyzed in detail. The results of applying these algorithms to flir data and to noisy optical data are given. An analysis of these algorithms illustrates the importance of an efficient global feature extraction hardware for image understanding applications. 16 references.

  9. Image processing system architecture using parallel arrays of digital signal processors

    NASA Astrophysics Data System (ADS)

    Kshirsagar, Shirish P.; Hobson, Clifford A.; Hartley, David A.; Harvey, David M.

    1993-10-01

    The paper describes the requirements of a high definition, high speed image processing system. Different types of parallel architectures were considered for the system. Advantages and limitations of SIMD and MIMD architectures are briefly discussed for image processing applications. A parallel image processing system based on MIMD architecture has been developed using multiple digital signal processors which can communicate with each other through an interconnection network. Texas Instruments TMS320C40 digital signal processors have been selected because they have a powerful floating point CPU supported by fast parallel communication ports, a DMA coprocessor and two memory interfaces. A five processor system is described in the paper. The EISA bus is used as the host interface and VISION bus is used to transfer images between the processors. The system is being used for automated non-contact inspection in which electro-optic signals are processed to identify manufacturing problems.

  10. Array processor architecture connection network

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1982-01-01

    A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.

  11. Spaceborne Processor Array

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  12. Array processors in chemistry

    SciTech Connect

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  13. VLSI array processor

    NASA Astrophysics Data System (ADS)

    Greenwood, E.

    1982-07-01

    The Arithmetic Processor Unit (APU) data base design check was completed. Minor design rule violations and design improvements were accomplished. The APU mask set has been fabricated and checked. Initial checking of all mask layers revealed a design rule problem in one layer. That layer was corrected, refabricated and checked out. The mask set has been delivered to the chip fabrication area. The fabrication process has been initiated. All work on the Array Processor Demonstration System (APDS) has been suspended at CHI until the additionally requested funding was received. That funding has been authorized and CHI will begin work on the APDS in July. The following activities are planned in the following quarter: 1) Complete fabrication of the first lot of VLSI APU devices. 2) Complete integration and check-out of the APDS simulator. 3) Complete integration and check-out of the APU breadboard. 4) Verify the VLSI APU wafer tests with the APU breadboard. 5) Complete check-out of the APDS using the APU breadboard.

  14. Parallel processor engine model program

    NASA Technical Reports Server (NTRS)

    Mclaughlin, P.

    1984-01-01

    The Parallel Processor Engine Model Program is a generalized engineering tool intended to aid in the design of parallel processing real-time simulations of turbofan engines. It is written in the FORTRAN programming language and executes as a subset of the SOAPP simulation system. Input/output and execution control are provided by SOAPP; however, the analysis, emulation and simulation functions are completely self-contained. A framework in which a wide variety of parallel processing architectures could be evaluated and tools with which the parallel implementation of a real-time simulation technique could be assessed are provided.

  15. Optical systolic array processor using residue arithmetic

    NASA Technical Reports Server (NTRS)

    Jackson, J.; Casasent, D.

    1983-01-01

    The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

  16. A highly parallel signal processor

    NASA Astrophysics Data System (ADS)

    Bigham, Jackson D., Jr.

    There is an increasing need for signal processors functional across a broad range of problems, from radar systems to E-O and ESM applications. To meet this challenge, a signal processing system capable of efficiently meeting the processing requirements over a broad range of avionics sensor systems has been developed. The CDC Parallel Modular Signal Processor (PMSP) is a complete MIL/E-5400-qualified digital signal processing system capable of computation rates greater than 600 MOPS (million operations per second). The signal processing element of the PMSP is the Micro-AFP. It is an all-VLSI processor capable of executing multiple simultaneous operations. Up to five Micro-AFPs and 12 MB of main store memory (MSM), along with associated control and I/O functions, are contained in the PMSP's standard ATR enclosure.

  17. Parallel Analog-to-Digital Image Processor

    NASA Technical Reports Server (NTRS)

    Lokerson, D. C.

    1987-01-01

    Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

  18. Rectangular Array Of Digital Processors For Planning Paths

    NASA Technical Reports Server (NTRS)

    Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

    1993-01-01

    Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.

  19. Adapting implicit methods to parallel processors

    SciTech Connect

    Reeves, L.; McMillin, B.; Okunbor, D.; Riggins, D.

    1994-12-31

    When numerically solving many types of partial differential equations, it is advantageous to use implicit methods because of their better stability and more flexible parameter choice, (e.g. larger time steps). However, since implicit methods usually require simultaneous knowledge of the entire computational domain, these methods axe difficult to implement directly on distributed memory parallel processors. This leads to infrequent use of implicit methods on parallel/distributed systems. The usual implementation of implicit methods is inefficient due to the nature of parallel systems where it is common to take the computational domain and distribute the grid points over the processors so as to maintain a relatively even workload per processor. This creates a problem at the locations in the domain where adjacent points are not on the same processor. In order for the values at these points to be calculated, messages have to be exchanged between the corresponding processors. Without special adaptation, this will result in idle processors during part of the computation, and as the number of idle processors increases, the lower the effective speed improvement by using a parallel processor.

  20. Ultrafast Fourier-transform parallel processor

    SciTech Connect

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  1. Grundy - Parallel processor architecture makes programming easy

    NASA Technical Reports Server (NTRS)

    Meier, R. J., Jr.

    1985-01-01

    The hardware, software, and firmware of the parallel processor, Grundy, are examined. The Grundy processor uses a simple processor that has a totally orthogonal three-address instruction set. The system contains a relative and indirect processing mode to support the high-level language, and uses pseudoprocessors and read-only memory. The system supports high-level language in which arbitrary degrees of algorithmic parallelism is expressed. The functions of the compiler and invocation frame are described. Grundy uses an operating system that can be accessed by an arbitrary number of processes simultaneously, and the access time grows only as the logarithm of the number of active processes. Applications for the parallel processor are discussed.

  2. Ray tracing on a networked processor array

    NASA Astrophysics Data System (ADS)

    Yang, Jungsook; Lee, Seung Eun; Chen, Chunyi; Bagherzadeh, Nader

    2010-10-01

    As computation costs increase to meet design requirements for computation-intensive graphics applications on today's embedded systems, the pressure to develop high-performance parallel processors on a chip will increase. Acceleration of the ray tracing computation has become a major issue as the computer graphics industry demands for rendering realistic images. Network-on-chip (NoC) techniques that interconnect multiple processing elements with routers are the solution for reducing computation time and power consumption by parallel processing on a chip. It is also essential to meet the scalability and complexity challenges for system-on-chip (SoC). In this article, we describe a parallel ray tracing application mapping on a mesh-based multicore NoC architecture. We describe an optimised ray tracing kernel and parallelisation strategies, varying the workload distribution statically and dynamically. In this work, we present results and timing performance of our parallel ray tracing application on a NoC, which are obtained through our cycle accurate multicore NoC simulator. Using a dynamic scheduling load balancing technique, we achieved a maximum speedup multiplier of 35.97 on an 8 × 8 networked processor array using a NoC as the interconnect.

  3. A parallel pipelined dataflow trigger processor

    SciTech Connect

    Lee, C.; Miller, G.; Kaplan, D.M.; Sa, J. ); Hsiung, Y.B. ); Carey, T.; Jeppesen, R. )

    1991-04-01

    This paper describes a parallel pipelined data flow trigger processor which is used in Fermilab E789. E789 is an experiment to study low-multiplicity decays of particles containing b or c quarks. The processor consists of an upstream vertex processor and a downstream track processor. The algorithms which reconstruct the postulated particle paths and calculate particle origin are implemented via interconnected function-specific hardware modules. The algorithm is directly dependent upon the organization of the modules, the specific arrangement of the inter-module cabling, on-board memory data. The processor provides an indication of the presence of at least one interesting particle pair in the current event by asserting Read on its Read/Skip output. The Read assertion is then used as a trigger to capture all of the event's data for subsequent extensive off-line analysis.

  4. Parallel processor programs in the Federal Government

    NASA Technical Reports Server (NTRS)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  5. Fault-tolerant parallel processor

    SciTech Connect

    Harper, R.E.; Lala, J.H. )

    1991-06-01

    This paper addresses issues central to the design and operation of an ultrareliable, Byzantine resilient parallel computer. Interprocessor connectivity requirements are met by treating connectivity as a resource that is shared among many processing elements, allowing flexibility in their configuration and reducing complexity. Redundant groups are synchronized solely by message transmissions and receptions, which aslo provide input data consistency and output voting. Reliability analysis results are presented that demonstrate the reduced failure probability of such a system. Performance analysis results are presented that quantify the temporal overhead involved in executing such fault-tolerance-specific operations. Empirical performance measurements of prototypes of the architecture are presented. 30 refs.

  6. Assignment Of Finite Elements To Parallel Processors

    NASA Technical Reports Server (NTRS)

    Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

    1990-01-01

    Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.

  7. The monarch parallel processor hardware design

    SciTech Connect

    Rettberg, R.D.; Crowther, W.R.; Carvey, P.P.; Tomlinson, R.S. )

    1990-04-01

    The authors report on their development of the Monarch parallel processor. Today, the Monarch's design is largely done and well into implementation. The high-speed interconnection network has been tested with two-micron switch chips, logging more than 30,000 device hours of operation at 125 mega bits per second passing over 10{sup 16} bits. The processor's logic design is almost complete and simulated. The memory controller and concentrator remain to be designed. The authors have analyzed the software in detail with the use of hand-coded examples, a simulator, and a rudimentary compiler. The authors are currently seeking support to finish the implementation.

  8. SLAPP: A systolic linear algebra parallel processor

    SciTech Connect

    Drake, B.L.; Luk, F.T.; Speiser, J.M.; Symanski, J.J.

    1987-07-01

    Systolic array computer architectures provide a means for fast computation of the linear algebra algorithms that form the building blocks of many signal-processing algorithms, facilitating their real-time computation. For applications to signal processing, the systolic array operates on matrices, an inherently parallel view of the data, using numerical linear algebra algorithms that have been suitably parallelized to efficiently utilize the available hardware. This article describes work currently underway at the Naval Ocean Systems Center, San Diego, California, to build a two-dimensional systolic array, SLAPP, demonstrating efficient and modular parallelization of key matric computations for real-time signal- and image-processing problems.

  9. Grundy: Parallel Processor Architecture Makes Programming Easy

    NASA Astrophysics Data System (ADS)

    Meier, Robert J.

    1985-12-01

    Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

  10. LU and Cholesky decomposition on an optical systolic array processor

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Ghosh, A.

    1983-01-01

    Direct solutions of matrix-vector equations on an optical systolic array processor are considered. The solutions are discussed and a parallel algorithm for LU matrix decomposition that is very attractive for an optical realization is formulated. It is noted that when direct techniques are used, it is preferable to realize the matrix decomposition on an optical system and to utilize a digital processor for the solution of the simplified resultant matrix-vector problem. One method of realizing LU matrix decomposition on a new frequency-multiplexed optical systolic array matrix-matrix processor is described. A simple method for extending the process of LU decomposition to Cholesky decomposition on the optical processor is discussed.

  11. Associative massively parallel processor for video processing

    NASA Astrophysics Data System (ADS)

    Krikelis, Argy; Tawiah, T.

    1996-03-01

    Massively parallel processing architectures have matured primarily through image processing and computer vision application. The similarity of processing requirements between these areas and video processing suggest that they should be very appropriate for video processing applications. This research describes the use of an associative massively parallel processing based system for video compression which includes architectural and system description, discussion of the implementation of compression tasks such as DCT/IDCT, Motion Estimation and Quantization and system evaluation. The core of the processing system is the ASP (Associative String Processor) architecture a modular massively parallel, programmable and inherently fault-tolerant fine-grain SIMD processing architecture incorporating a string of identical APEs (Associative Processing Elements), a reconfigurable inter-processor communication network and a Vector Data Buffer for fully-overlapped data input-output. For video compression applications a prototype system is developed, which is using ASP modules to implement the required compression tasks. This scheme leads to a linear speed up of the computation by simply adding more APEs to the modules.

  12. Scalable Unix tools on parallel processors

    SciTech Connect

    Gropp, W.; Lusk, E.

    1994-12-31

    The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.

  13. Intermediate-level computer-vision-processing algorithm development for the content-addressable-array parallel processor. Quarterly status report No. 3 for period ending 29 November 1986

    SciTech Connect

    Not Available

    1986-12-15

    During this quarter a set of seven benchmark problems were developed and analyzed for the IUA. These included Hough Transform, Convex Hull, Voronoi Diagram, Minimal Spanning Tree, Visibility of Vertices in a projected 3-dimensional model, subgraph isomorphism, and the minimum-cost path between points in a weighted graph. These problems are commonly considered intermediate-level processing in many visions research groups parallel implementations of UMass intermediate level processing algorithms, such as Boldt's line merging and Anandan's motion analysis continued to develop. A commercial processor, the TMS320C25, was chosen as the Intermediate Communications and Associative Processor (ICAP) processing element. The TMS320C25 has the advantages that it is a five-million instruction per second signal-processing unit with a fast multiplier and software support for fast floating-point operations. It also has a built in 5 Mb/S serial port that will interface well with the intermediate-level communications network. Also being explored is a set of group-theoretic network topologies with respect to the communication needs of intermediate-level processing. This has required the analysis of the classes of communication needed in each of the algorithms implemented.

  14. The Massively Parallel Processor and its applications. [for environmental monitoring

    NASA Technical Reports Server (NTRS)

    Strong, J. P.; Schaefer, D. H.; Fischer, J. R.; Wallgren, K. R.; Bracken, P. A.

    1979-01-01

    A long-term experimental development program conducted at Goddard Space Flight Center to implement an ultrahigh-speed data processing system known as the Massively Parallel Processor (MPP) is described. The MPP is a single instruction multiple data stream computer designed to perform logical, integer, and floating point arithmetic operations on variable word length data. Information is presented on system architecture, the system configuration, the array unit architecture, individual processing units, and expected operating rates for several image processing applications (including the processing of Landsat data).

  15. APEmille: a parallel processor in the teraflop range

    NASA Astrophysics Data System (ADS)

    Panizzi, E.

    1997-02-01

    APEmille is a SIMD parallel processor under development at the Italian National Institute for Nuclear Physics (INFN). It is the third machine of the APE family, following Ape and Ape100 and delivering peak performance in the Tflops range. APEmille is very well suited for Lattice QCD applications, both for its hardware characteristics and for its software and language features. APEmille is an array of custom arithmetic processors arranged on a tridimensional torus. The replicated processor is a pipelined VLIW device performing integer and single/double precision IEEE floating point operations. The processor is optimized for complex computations and has a peak performance of 528Mflop at 66MHz. Each replica has 8 Mbytes of locally addressable RAM. In principle an array of 2048 nodes is able to break the Tflops barrier. Two other custom processors are used for program flow control, global addressing and inter node communications. Fast nearest neighbour communications as well as longer distance communications and data broadcast are available. APEmille is interfaced to the external world by a PCI interface and a HIPPI channel. A network of PCs act as the host computer. The APE operating system and the cross compiler run on it. A powerful programming language named TAO is provided and is highly optimized for QCD. A C++ compiler is foreseen. The TAO language is as simple as Fortran but as powerful as object oriented languages. Specific data structures, operators and even statements can be defined by the user for each different application. Effort has been made to define the language constructs for QCD.

  16. Solid modeling on a massively parallel processor

    SciTech Connect

    Strip, D. ); Karasick, M. )

    1992-01-01

    Solid modeling underlies many technologies that are key to modern manufacturing. These range from computer-aided design systems to robot simulators, from finite element analysis to integrated circuit process modeling. The accuracy, and hence the utility, of these models is often constrained by the amount of computer time required to perform the desired operations. This paper presents a family of algorithms for solid modeling operations using the Connection Machine, a massively parallel SIMD processor. The authors describe a data structure for representing solid models and algorithms that use the representation to implement efficiently a variety of solid modeling operations. The authors give a sketch of the algorithm for intersecting solids and present computational experience using these algorithms. The data structure and algorithms are contrasted with those of serial architectures, and execution times are compared.

  17. Intelligent spatial ecosystem modeling using parallel processors

    SciTech Connect

    Maxwell, T.; Costanza, R. )

    1993-05-01

    Spatial modeling of ecosystems is essential if one's modeling goals include developing a relatively realistic description of past behavior and predictions of the impacts of alternative management policies on future ecosystem behavior. Development of these models has been limited in the past by the large amount of input data required and the difficulty of even large mainframe serial computers in dealing with large spatial arrays. These two limitations have begun to erode with the increasing availability of remote sensing data and GIS systems to manipulate it, and the development of parallel computer systems which allow computation of large, complex, spatial arrays. Although many forms of dynamic spatial modeling are highly amenable to parallel processing, the primary focus in this project is on process-based landscape models. These models simulate spatial structure by first compartmentalizing the landscape into some geometric design and then describing flows within compartments and spatial processes between compartments according to location-specific algorithms. The authors are currently building and running parallel spatial models at the regional scale for the Patuxent River region in Maryland, the Everglades in Florida, and Barataria Basin in Louisiana. The authors are also planning a project to construct a series of spatially explicit linked ecological and economic simulation models aimed at assessing the long-term potential impacts of global climate change.

  18. Efficient searching and sorting applications using an associative array processor

    NASA Technical Reports Server (NTRS)

    Pace, W.; Quinn, M. J.

    1978-01-01

    The purpose of this paper is to describe a method of searching and sorting data by using some of the unique capabilities of an associative array processor. To understand the application, the associative array processor is described in detail. In particular, the content addressable memory and flip network are discussed because these two unique elements give the associative array processor the power to rapidly sort and search. A simple alphanumeric sorting example is explained in hardware and software terms. The hardware used to explain the application is the STARAN (Goodyear Aerospace Corporation) associative array processor. The software used is the APPLE (Array Processor Programming Language) programming language. Some applications of the array processor are discussed. This summary tries to differentiate between the techniques of the sequential machine and the associative array processor.

  19. Global Arrays Parallel Programming Toolkit

    SciTech Connect

    Nieplocha, Jaroslaw; Krishnan, Manoj Kumar; Palmer, Bruce J.; Tipparaju, Vinod; Harrison, Robert J.; Chavarría-Miranda, Daniel

    2011-01-01

    The two predominant classes of programming models for parallel computing are distributed memory and shared memory. Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing fine grain load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message-passing. However, this performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distributed memory models, such as message-passing or one-sided communication, offer performance and scalability but they are difficult to program. The Global Arrays toolkit attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC, Titanium, and, to a lesser extent, Co-Array Fortran. In addition, by providing a set of data-parallel operations, GA is also related to data-parallel languages such as HPF, ZPL, and Data Parallel C. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving

  20. Scan line graphics generation on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1988-01-01

    Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

  1. Image Rotation Correction With CORDIC Array Processor

    NASA Astrophysics Data System (ADS)

    Shyu, Keh-Hwa; Jeng, Bor-Shenn; Jou, I.-Chang; Ting, Pei-Yih

    1988-10-01

    In the document analysis system or the understanding system[1,2], the rotation of the document's image will cause optical character recognition error. Then the document must be scanned and recognized again. This phenomenon will degrade the performance of the automatic document input system. In this paper, we propose a method to estimate the unexpected rotational angle of the image. And we suggest using the pipelined CORDIC array processor architecture to rotate the image back quickly. Thus the performance of the automatic document input system will increase.

  2. Parallel processor for real-time structural control

    NASA Astrophysics Data System (ADS)

    Tise, Bert L.

    1993-07-01

    A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-to-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection to host computer, parallelizing code generator, and look- up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating- point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An OpenWindows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.

  3. A systolic array parallelizing compiler

    SciTech Connect

    Tseng, P.S. )

    1990-01-01

    This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

  4. Breadboard Signal Processor for Arraying DSN Antennas

    NASA Technical Reports Server (NTRS)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; Rayhrer, Benno

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  5. Computing the Hough transform on a scan line array processor

    SciTech Connect

    Fisher, A.L.; Highnam, P.T.

    1989-03-01

    This paper describes a parallel algorithm for a line-finding Hough transform that runs on a linearly connected, SIMD vector of processors. The authors show that a high-precision transform, usually considered to be an expensive global operation, can be performed efficiently, in two to three times real time, with only local communication on a long vector. The algorithm also illustrates a decomposition principle that has wide application in algorithm design for large linear arrays. They include a review of straight-line Hough transform implementations.

  6. High density packaging and interconnect of massively parallel image processors

    NASA Technical Reports Server (NTRS)

    Carson, John C.; Indin, Ronald J.

    1991-01-01

    This paper presents conceptual designs for high density packaging of parallel processing systems. The systems fall into two categories: global memory systems where many processors are packaged into a stack, and distributed memory systems where a single processor and many memory chips are packaged into a stack. Thermal behavior and performance are discussed.

  7. Chemical network problems solved on NASA/Goddard's massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Cho, Seog Y.; Carmichael, Gregory R.

    1987-01-01

    The single instruction stream, multiple data stream Massively Parallel Processor (MPP) unit consists of 16,384 bit serial arithmetic processors configured as a 128 x 128 array whose speed can exceed that of current supercomputers (Cyber 205). The applicability of the MPP for solving reaction network problems is presented and discussed, including the mapping of the calculation to the architecture, and CPU timing comparisons.

  8. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J.

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  9. Multithreaded processor architecture for parallel symbolic computation. Technical report

    SciTech Connect

    Fujita, T.

    1987-09-01

    This paper describes the Multilisp Architecture for Symbolic Applications (MASA), which is a multithreaded processor architecture for parallel symbolic computation with various features intended for effective Multilisp program execution. The principal mechanisms exploited for this processor are multiple contexts, interleaved pipeline execution from separate instruction streams, and synchronization based on a bit in each memory cell. The tagged architecture approach is taken for Lisp program execution, and trap conditions are provided for future object manipulation and garbage collection.

  10. Massively parallel MRI detector arrays

    NASA Astrophysics Data System (ADS)

    Keil, Boris; Wald, Lawrence L.

    2013-04-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas via reception, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays.

  11. Massively Parallel MRI Detector Arrays

    PubMed Central

    Keil, Boris; Wald, Lawrence L

    2013-01-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  12. Massively parallel MRI detector arrays.

    PubMed

    Keil, Boris; Wald, Lawrence L

    2013-04-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas via reception, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called "ultimate" SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  13. Singular value decomposition utilizing parallel algorithms on graphical processors

    SciTech Connect

    Kotas, Charlotte W; Barhen, Jacob

    2011-01-01

    One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements for a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder

  14. CMOS processor element for a fault-tolerant SVD array

    NASA Astrophysics Data System (ADS)

    Kota, Kishore; Cavallaro, Joseph R.

    1993-11-01

    This paper describes the VLSI implementation of a CORDIC based processor element for use in a fault-reconfigurable systolic array to compute the singular value decomposition (SVD) of a matrix. The chip implements a time redundant fault tolerance scheme, which allows processors adjacent to a faulty processor to act as computation backup during the systolic idle time. Also, processors around a fault collaborate to reroute data around the faulty processor. This form of time redundancy is attractive when tolerance to a few faults needs to be achieved with little hardware overhead.

  15. Global synchronization of parallel processors using clock pulse width modulation

    SciTech Connect

    Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

    2013-04-02

    A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

  16. DFT algorithms for bit-serial GaAs array processor architectures

    NASA Technical Reports Server (NTRS)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  17. Staging memory for massively parallel processor

    NASA Technical Reports Server (NTRS)

    Batcher, Kenneth E. (Inventor)

    1988-01-01

    The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

  18. An informal introduction to parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. During that time I came in contact with some developments in computer science that were unfamiliar to me as a mathematician. Since most mathematicians have some exposure to computers, but certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to parallel processing. This paper is not meant to be a full explanation of the topic, but an informal introduction for the ``mathematical layman.``

  19. Linear-time algorithms for scheduling on parallel processors

    SciTech Connect

    Monma, C.L.

    1982-01-01

    Linear-time algorithms are presented for several problems of scheduling n equal-length tasks on m identical parallel processors subject to precedence constraints. This improves upon previous time bounds for the maximum lateness problem with treelike precedence constraints, the number-of-late-tasks problem without precedence constraints, and the one machine maximum lateness problem with general precedence constraints. 5 references.

  20. Dynamic overset grid communication on distributed memory parallel processors

    NASA Technical Reports Server (NTRS)

    Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

    1993-01-01

    A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.

  1. Experience with a multiprocessor based on eight FPS 120B array processors

    SciTech Connect

    Bucher, I.Y.; Frederickson, P.O.; Moore, J.W.

    1981-01-01

    The rate of increase in the speed of monoprocessors is no longer keeping pace with the needs of the laboratory; accordingly, the use of parallel processors in large scientific computations is being investigated. As an initial experiment, a particle-in-cell plasma simulation was adapted to run on a star graph architecture consisting of a UNIVAC 1110 as hub, and up to eight Floating Point Systems AP120B array processors at the other vertices. Subdivision of tasks among processors and measured results are discussed.

  2. An Evaluation of Document Retrieval from Serial Files Using the ICL Distributed Array Processor.

    ERIC Educational Resources Information Center

    Pogue, Christine; Willett, Peter

    1984-01-01

    Describes preliminary investigation of the use of International Computers Limited's Distributed Array Processor (DAP) for parallel searching of large serial files of documents. DAP hardware and software, test collections, measurement of DAP performance, search algorithms, experimental results, and DAP suitability for interactive searching are…

  3. Real-time trajectory optimization on parallel processors

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  4. Automatic generation of synchronization instructions for parallel processors

    SciTech Connect

    Midkiff, S.P.

    1986-05-01

    The development of high speed parallel multi-processors, capable of parallel execution of doacross and forall loops, has stimulated the development of compilers to transform serial FORTRAN programs to parallel forms. One of the duties of such a compiler must be to place synchronization instructions in the parallel version of the program to insure the legal execution order of doacross and forall loops. This thesis gives strategies usable by a compiler to generate these synchronization instructions. It presents algorithms for reducing the parallelism in FORTRAN programs to match a target architecture, recovering some of the parallelism so discarded, and reducing the number of synchronization instructions that must be added to a FORTRAN program, as well as basic strategies for placing synchronization instructions. These algorithms are developed for two synchronization instruction sets. 20 refs., 56 figs.

  5. Potential of minicomputer/array-processor system for nonlinear finite-element analysis

    NASA Technical Reports Server (NTRS)

    Strohkorb, G. A.; Noor, A. K.

    1983-01-01

    The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.

  6. Finding maximum on an array processor with a global bus

    NASA Technical Reports Server (NTRS)

    Bokhari, S. H.

    1984-01-01

    The problem of finding the maximum of a set of values stored one/processor on an n x n array of processors is analyzed. The array has a time-shared global bus in addition to conventional processor-processor links. A two-phase algorithm for finding the maximum is presented that uses conventional links during the first phase and the global bus during the second. This algorithm is faster than algorithms that use either only the global bus or only the conventional links. Two types of interconnection patterns (the eighth nearest neighbor and the fourth nearest neighbor) are analyzed. In both cases it is shown that the time required to find the maximum using the two-phase algorithm is 0(n to the 2/3-power) assuming the propagation speed of the global bus to be a constant independent of the size of the array. In the case where the propagation speed is logarithmic in the number of processors, the time to find the maximum is 0(/n-squared log n/1/3), for both types of arrays. Extensions to q-dimensional arrays show that the two-phase algorithm is superior for any fixed value of q.

  7. Fabrication of fault-tolerant systolic array processors

    SciTech Connect

    Golovko, V.A.

    1995-05-01

    Methods for designing fault-tolerant systolic array processors are discussed. Several ways of bypassing faulty elements in configurations, which depend on an input-data flow organization, are suggested. An analysis of the additional hardware costs of providing fault tolerance by various techniques and for various levels of redundancy is presented. Hadamard fault-tolerant processor design was used to illustrate the efficiency of the techniques suggested.

  8. A Josephson systolic array processor for multiplication/addition operations

    SciTech Connect

    Morisue, M.; Li, F.Q.; Tobita, M.; Kaneko, S. )

    1991-03-01

    A novel Josephson systolic array processor to perform multiplication/addition operations is proposed. The systolic array processor proposed here consists of a set of three kinds of interconnected cells of which main circuits are made by using SQUID gates. A multiplication of 2 bits by 2 bits is performed in the single cell at a time and an addition of three data with two bits is simultaneously performed in an another type of cell. Furthermore, information in this system flows between cells in a pipeline fashion so that a high performance can be achieved. In this paper the principle of Josephson systolic array processor is described in detail and the simulation results are illustrated for the multiplication/addition of (4 bits {times} 4 bits + 8 bits). The results show that these operations can be executed in 330ps.

  9. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    DOEpatents

    Chatterjee, Siddhartha; Gunnels, John A.

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  10. Pipelining multiple singular value decomposition (SVDs) on a single processor array

    NASA Astrophysics Data System (ADS)

    Kota, Kishore; Cavallaro, Joseph R.

    1994-10-01

    We present a new family of architectures for processor arrays to implement Jacobi SVD which allow systolic loading and unloading of input and result matrices. Unlike most of the previous SVD arrays in the literature, our architectures do not require special handling of external I/O and hence are closer to the traditional concept of systolic architectures. The boundary processors communicate with the host the same way any of the interior processors communicate with their neighbors. The arrays are surprisingly uniform and simple. The various architectures in the family represent different throughput-hardware tradeoffs corresponding to the degree to which the multiple sweeps have been unrolled and determine the number of independent SVDs which may be pipelined on the array. We achieved systolic loading by using the flexibility provided by the cyclic Jacobi method on the order in which pivot pairs may be chosen. The array operates on the matrix data even as it is being loaded. Once the pipeline is full, the ordering is very similar to odd-even ordering. Our ordering is equivalent to cyclic-by-rows ordering and hence the algorithm is guaranteed to converge. Our systolic loading scheme is very important in an I/O limited system, since it allows more communication to occur in parallel, where the communication includes the loading and unloading operations. The array with the highest throughput in our family of architectures, which implement one-sided Jacobi (either Hestenes' method or Eberlein and Park's method), is a linear array of processors with unidirectional links between neighbors. The architectures with lower throughput require fewer processors connected in a ring, allowing data to recirculate among the processors. The input matrix is loaded one column at a time from the left and the results stream out one column at a time from the right.

  11. Ring-array processor distribution topology for optical interconnects

    NASA Technical Reports Server (NTRS)

    Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

    1992-01-01

    The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.

  12. Feasibility of optically interconnected parallel processors using wavelength division multiplexing

    SciTech Connect

    Deri, R.J.; De Groot, A.J.; Haigh, R.E.

    1996-03-01

    New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little information is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.

  13. Analog parallel processor hardware for high speed pattern recognition

    NASA Technical Reports Server (NTRS)

    Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

    1990-01-01

    A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

  14. Optimal mapping of irregular finite element domains to parallel processors

    NASA Technical Reports Server (NTRS)

    Flower, J.; Otto, S.; Salama, M.

    1987-01-01

    Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

  15. Guidelines for efficient use of optical systolic array processors

    SciTech Connect

    Casasent, D.

    1983-01-01

    The design, error analysis, component accuracy required, computational capacity, data flow and pipelining, plus the algorithm and application all seriously impact the use of optical systolic array processors. The author provides initial remarks, results, examples and solutions for each of these issues. 20 references.

  16. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  17. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1990-01-01

    Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.

  18. Particle simulation of plasmas on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Gledhill, I. M. A.; Storey, L. R. O.

    1987-01-01

    Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.

  19. Frequency-multiplexed and pipelined iterative optical systolic array processors

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Jackson, J.; Neuman, C.

    1983-01-01

    Optical matrix processors using acoustooptic transducers are described, with emphasis on new systolic array architectures using frequency multiplexing in addition to space and time multiplexing. A Kalman filtering application is considered in a case study from which the operations required on such a system can be defined. This also serves as a new and powerful application for iterative optical processors. The importance of pipelining the data flow and the ordering of the operations performed in a specific application of such a system are also noted. Several examples of how to effectively achieve this are included. A new technique for handling bipolar data on such architectures is also described.

  20. VLSI array processor R&D status report

    NASA Astrophysics Data System (ADS)

    Greenwood, E.

    1982-01-01

    Detail design of the Arithmetic Processor Unit (APU) chip has been completed. All cell types (100) have been run through the design rule check (DRC) programs, corrected and verified. DRC runs on the entire chip have been run and all corrections have been made. Fifteen out of eighteen of the chip DRC corrections have been verified. The metal, polysilicon and information data layers of the APU layout is shown. The attached drawings, titled 'VLSI Array Processor Arithmetic Processor Unit Chip Plan' is a detail drawing of the APU Chip Plan. The functional level simulator of the APU has been built and verified using a set of APU diagnostic code. A gate level logic simulation of the APU has been built. The APU breadboard modules have been fabricated and check out has been initiated. The Array Processor Demonstration System (APDS) modules are in the wire-wrap process. The APDS and APU microcode assembler have been built and checked out. The linker and loader for the APDS have also been built.

  1. Digital signal processor and programming system for parallel signal processing

    SciTech Connect

    Van den Bout, D.E.

    1987-01-01

    This thesis describes an integrated assault upon the problem of designing high-throughput, low-cost digital signal-processing systems. The dual prongs of this assault consist of: (1) the design of a digital signal processor (DSP) which efficiently executes signal-processing algorithms in either a uniprocessor or multiprocessor configuration, (2) the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level signal-processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an interprocessor communications port (IPC) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multi-processor modes of operation. The PaLS programming system simplifies the task of parallelizing an algorithm for execution upon a multiprocessor built with the DSP.

  2. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  3. On program restructuring, scheduling, and communication for parallel processor systems

    SciTech Connect

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  4. Beam dynamics calculations and particle tracking using massively parallel processors

    SciTech Connect

    Ryne, R.D.; Habib, S.

    1995-12-31

    During the past decade massively parallel processors (MPPs) have slowly gained acceptance within the scientific community. At present these machines typically contain a few hundred to one thousand off-the-shelf microprocessors and a total memory of up to 32 GBytes. The potential performance of these machines is illustrated by the fact that a month long job on a high end workstation might require only a few hours on an MPP. The acceptance of MPPs has been slow for a variety of reasons. For example, some algorithms are not easily parallelizable. Also, in the past these machines were difficult to program. But in recent years the development of Fortran-like languages such as CM Fortran and High Performance Fortran have made MPPs much easier to use. In the following we will describe how MPPs can be used for beam dynamics calculations and long term particle tracking.

  5. Semantic network array processor and its applications to image understanding

    SciTech Connect

    Dixit, V.; Moldovan, D.I.

    1987-01-01

    The problems in computer vision range from edge detection and segmentation at the lowest level to the problem of cognition at the highest level. This correspondence describes the organization and operation of a semantic network array processor (SNAP) as applicable to high level computer vision problems. The architecture consists of an array of identical cells each containing a content addressable memory, microprogram control, and a communication unit. The applications discussed in this paper are the two general techniques, discrete relaxation and dynamic programming. While the discrete relaxation is discussed with reference to scene labeling and edge interpretation, the dynamic programming is tuned for stereo.

  6. The performance realities of massively parallel processors: A case study

    SciTech Connect

    Lubeck, O.M.; Simmons, M.L.; Wasserman, H.J.

    1992-07-01

    This paper presents the results of an architectural comparison of SIMD massive parallelism, as implemented in the Thinking Machines Corp. CM-2 computer, and vector or concurrent-vector processing, as implemented in the Cray Research Inc. Y-MP/8. The comparison is based primarily upon three application codes that represent Los Alamos production computing. Tests were run by porting optimized CM Fortran codes to the Y-MP, so that the same level of optimization was obtained on both machines. The results for fully-configured systems, using measured data rather than scaled data from smaller configurations, show that the Y-MP/8 is faster than the 64k CM-2 for all three codes. A simple model that accounts for the relative characteristic computational speeds of the two machines, and reduction in overall CM-2 performance due to communication or SIMD conditional execution, is included. The model predicts the performance of two codes well, but fails for the third code, because the proportion of communications in this code is very high. Other factors, such as memory bandwidth and compiler effects, are also discussed. Finally, the paper attempts to show the equivalence of the CM-2 and Y-MP programming models, and also comments on selected future massively parallel processor designs.

  7. Solution of large linear systems of equations on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Ida, Nathan; Udawatta, Kapila

    1987-01-01

    The Massively Parallel Processor (MPP) was designed as a special machine for specific applications in image processing. As a parallel machine, with a large number of processors that can be reconfigured in different combinations it is also applicable to other problems that require a large number of processors. The solution of linear systems of equations on the MPP is investigated. The solution times achieved are compared to those obtained with a serial machine and the performance of the MPP is discussed.

  8. Optimal evaluation of array expressions on massively parallel machines

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

    1992-01-01

    We investigate the problem of evaluating FORTRAN 90 style array expressions on massively parallel distributed-memory machines. On such machines, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression. The choice of where to perform the operation then affects this cost. We present algorithms based on dynamic programming to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider expressions containing operations that change the shape of the arrays, and show that our approach extends naturally to handle this case.

  9. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

    1999-08-24

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

  10. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, Robert J.; Brooks, III, Eugene D.; Haigh, Ronald E.; DeGroot, Anthony J.

    1999-01-01

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

  11. Uncooled array IR sensors integrated with CCD analog processor

    NASA Astrophysics Data System (ADS)

    Chernokozhin, Vladimir V.; Pevtsov, Eugeny P.; Sigov, Alexander S.

    2002-04-01

    The technological methods proposed above allow one to prepare integrated structures of multi-element heat detectors and special CCD. In thin structure the sensitive film is isolated from the substrate by means of a supporting membrane or serves as the membrane itself. Such a technology seems to be advantageous in further development of different MEMS structures. There is created a completely monolithic pyroelectric array of sensors 100 X 100 micrometers 2 based on a heat-sensitive film construction lifted slightly above the crystal and also detector specimens with NETD less than 0.2 - 0.5 K (8 - 12 micrometers at 300 K and 20 - 50 Hz of modulation frequency). Derived measurements and investigations allowed us to choose the structure of 2D analogue CCD processor which now is under design and which will be integrated with pyroelectric membrane array.

  12. Partitioning: An essential step in mapping algorithms into systolic array processors

    SciTech Connect

    Navarro, J.J.; Llaberia, J.M.; Valero, M.

    1987-07-01

    Many scientific and technical applications require high computing speed; those involving matrix computations are typical. For applications involving matrix computations, algorithmically specialized, high-performance, low-cost architectures have been conceived and implemented. Systolic array processors (SAPs) are a good example of these machines. An SAP is a regular array of simple processing elements (PEs) that have a nearest-neighbor interconnection pattern. The simplicity, modularity, and expandability of SAPs make them suitable for VLSI/WSI implementation. Algorithms that are efficiently executed on SAPs are called systolic algorithms (SAs). An SA uses an array of systolic cells whose parallel operations must be specified. When an SA is executed on an SAP, the specified computations of each cell are carried out by a PE of the SAP.

  13. On nonlinear finite element analysis in single-, multi- and parallel-processors

    NASA Technical Reports Server (NTRS)

    Utku, S.; Melosh, R.; Islam, M.; Salama, M.

    1982-01-01

    Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

  14. On some parallel algorithms on a ring of processors

    NASA Astrophysics Data System (ADS)

    Sameh, A.

    1985-07-01

    In this paper we describe some linear algebra multiprocessor algorithms which are suitable for a ring of processors. These algorithms are organized in such a way as to be easily modified for general-purpose multiprocessors with shared global memories.

  15. A hierarchical, automated target recognition algorithm for a parallel analog processor

    NASA Technical Reports Server (NTRS)

    Woodward, Gail; Padgett, Curtis

    1997-01-01

    A hierarchical approach is described for an automated target recognition (ATR) system, VIGILANTE, that uses a massively parallel, analog processor (3DANN). The 3DANN processor is capable of performing 64 concurrent inner products of size 1x4096 every 250 nanoseconds.

  16. Fast neural net simulation with a DSP processor array.

    PubMed

    Muller, U A; Gunzinger, A; Guggenbuhl, W

    1995-01-01

    This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications. PMID:18263299

  17. Design of a dataway processor for a parallel image signal processing system

    NASA Astrophysics Data System (ADS)

    Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

    1995-04-01

    Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

  18. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  19. Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

    1997-01-01

    A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

  20. Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

    NASA Technical Reports Server (NTRS)

    Chen, Paul Peichuan

    1993-01-01

    Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

  1. Some parallel algorithms on the four processor Cray X-MP4 supercomputer

    SciTech Connect

    Kincaid, D.R.; Oppe, T.C.

    1988-05-01

    Three numerical studies of parallel algorithms on a four processor Cray X-MP4 supercomputer are presented. These numerical experiments involve the following: a parallel version of ITPACKV 2C, a package for solving large sparse linear systems, a parallel version of the conjugate gradient method with line Jacobi preconditioning, and several parallel algorithms for computing the LU-factorization of dense matrices. 27 refs., 4 tabs.

  2. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    PubMed

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads. PMID:27504236

  3. Track recognition in 4 [mu]s by a systolic trigger processor using a parallel Hough transform

    SciTech Connect

    Klefenz, F.; Noffz, K.H.; Conen, W.; Zoz, R.; Kugel, A. . Lehrstuhl fuer Informatik V); Maenner, R. . Lehrstuhl fuer Informatik V Univ. Heidelberg . Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen)

    1993-08-01

    A parallel Hough transform processor has been developed that identifies circular particle tracks in a 2D projection of the OPAL jet chamber. The high-speed requirements imposed by the 8 bunch crossing mode of LEP could be fulfilled by computing the starting angle and the radius of curvature for each well defined track in less than 4 [mu]s. The system consists of a Hough transform processor that determines well defined tracks, and a Euler processor that counts their number by applying the Euler relation to the thresholded result of the Hough transform. A prototype of a systolic processor has been built that handles one sector of the jet chamber. It consists of 35 [times] 32 processing elements that were loaded into 21 programmable gate arrays (XILINX). This processor runs at a clock rate of 40 MHz. It has been tested offline with about 1,000 original OPAL events. No deviations from the off-line simulation have been found. A trigger efficiency of 93% has been obtained. The prototype together with the associated drift time measurement unit has been installed at the OPAL detector at LEP and 100k events have been sampled to evaluate the system under detector conditions.

  4. Using algebra for massively parallel processor design and utilization

    NASA Technical Reports Server (NTRS)

    Campbell, Lowell; Fellows, Michael R.

    1990-01-01

    This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

  5. High speed vision processor with reconfigurable processing element array based on full-custom distributed memory

    NASA Astrophysics Data System (ADS)

    Chen, Zhe; Yang, Jie; Shi, Cong; Qin, Qi; Liu, Liyuan; Wu, Nanjian

    2016-04-01

    In this paper, a hybrid vision processor based on a compact full-custom distributed memory for near-sensor high-speed image processing is proposed. The proposed processor consists of a reconfigurable processing element (PE) array, a row processor (RP) array, and a dual-core microprocessor. The PE array includes two-dimensional processing elements with a compact full-custom distributed memory. It supports real-time reconfiguration between the PE array and the self-organized map (SOM) neural network. The vision processor is fabricated using a 0.18 µm CMOS technology. The circuit area of the distributed memory is reduced markedly into 1/3 of that of the conventional memory so that the circuit area of the vision processor is reduced by 44.2%. Experimental results demonstrate that the proposed design achieves correct functions.

  6. A garbage collection algorithm for shared memory parallel processors

    SciTech Connect

    Crammond, J. )

    1988-12-01

    This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.

  7. Parallel transport gates in a mixed-species ion trap processor

    NASA Astrophysics Data System (ADS)

    Home, Jonathan

    Scaled up quantum information processors will require large numbers of parallel gate operations. For ion trap quantum processing, a promising approach is to perform these operations in separated regions of a multi-zone processing chip between which quantum information is transported either by distributed photonic entanglement or by deterministic shuttling of the ions through the array. However scaling the technology for controlling pulsed laser beams which address each of multiple regions appears challenging. I will describe recent work on the control of both beryllium and calcium ions by transporting ions through static laser beams. We have demonstrated both parallel individually addressed operations as well as sequences of operations. Work is in progress towards multi-qubit gates, which requires good control of the ion transport velocity. We have developed a number of techniques for measuring and optimizing velocities in our trap, enabling significant improvements in performance. In addition to direct results, I will give an overview of our multi-species apparatus, including recent results on high fidelity multi-qubit gates. We are grateful for funding from the Swiss National Science Foundation and the ETH Zurich.

  8. Application of a floating point systems AP190L array processor to finite element analysis

    SciTech Connect

    Young, R.C.

    1982-04-01

    This report discusses the implementation of a finite element program on a Floating Point Systems AP190L array processor attached to a Univac 1182 host computer. The array processor was used to perform all calculations on the global system of linear equations including matrix assembly, matrix factoring and vector solution. A large scratch disk was attached directly to the array processor for storing the factored matrix. The remaining calculations, including data preparation, element matrix formation, stress integration and output display were performed by the host computer.

  9. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1989-01-01

    A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

  10. Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

    1994-01-01

    In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  11. Real-time tracking with a 3D-Flow processor array

    SciTech Connect

    Crosetto, D.

    1993-06-01

    The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was thought to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc., with respect to the CAM approach. The report describes real-time track finding using new computing approach technique based on the 3D-Flow array processor system. This system consists of a fixed interconnection architecture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project.

  12. Preliminary study on the potential usefulness of array processor techniques for structural synthesis

    NASA Technical Reports Server (NTRS)

    Feeser, L. J.

    1980-01-01

    The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.

  13. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

    1989-01-01

    The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

  14. Coupled cluster algorithms for networks of shared memory parallel processors

    NASA Astrophysics Data System (ADS)

    Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.

    2007-05-01

    As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.

  15. Reduction of solar vector magnetograph data using a microMSP array processor

    NASA Technical Reports Server (NTRS)

    Kineke, Jack

    1990-01-01

    The processing of raw data obtained by the solar vector magnetograph at NASA-Marshall requires extensive arithmetic operations on large arrays of real numbers. The objectives of this summer faculty fellowship study are to: (1) learn the programming language of the MicroMSP Array Processor and adapt some existing data reduction routines to exploit its capabilities; and (2) identify other applications and/or existing programs which lend themselves to array processor utilization which can be developed by undergraduate student programmers under the provisions of project JOVE.

  16. Numerically stable Jacobi array for parallel singular value decomposition (SVD) updating

    NASA Astrophysics Data System (ADS)

    Vanpoucke, Filiep J.; Moonen, Marc; Deprettere, Ed F. A.

    1994-10-01

    A novel algorithm is presented for updating the singular value decomposition in parallel. It is an improvement upon an earlier developed Jacobi-type SVD updating algorithm, where now the exact orthogonality of a certain matrix is guaranteed by means of a minimal factorization in terms of angles. Its orthogonality is known to be crucial for the numerical stability of the overall algorithm. The factored approach leads to a triangular array of rotation cells, implementing an orthogonal matrix-vector multiplication, and a novel array for SVD updating. Both arrays can be built up of CORDIC processors since the algorithms make exclusive use of orthogonal planar transformations.

  17. High-speed Systolic Array Processor (HISSAP) system development synopsis: Lesson learned. Final report, Oct 83-Oct 90

    SciTech Connect

    Loughlin, J.P.

    1991-05-01

    This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic and software mapping tools is described. Based on the experience gained in the mapping of matrix-based algorithms on the testbed hardware, specific recommendations are presented in the form of lessons learned, which are intended to offer guidance in the development of future Navy signal processing systems.

  18. Aligning parallel arrays to reduce communication

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

    1994-01-01

    Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

  19. Array distribution in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  20. Construction of a parallel processor for simulating manipulators and other mechanical systems

    NASA Technical Reports Server (NTRS)

    Hannauer, George

    1991-01-01

    This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

  1. Parallel fabrication of plasmonic nanocone sensing arrays.

    PubMed

    Horrer, Andreas; Schäfer, Christian; Broch, Katharina; Gollmer, Dominik A; Rogalski, Jan; Fulmes, Julia; Zhang, Dai; Meixner, Alfred J; Schreiber, Frank; Kern, Dieter P; Fleischer, Monika

    2013-12-01

    A fully parallel approach for the fabrication of arrays of metallic nanocones and triangular nanopyramids is presented. Different processes utilizing nanosphere lithography for the creation of etch masks are developed. Monolayers of spheres are reduced in size and directly used as masks, or mono- and double layers are employed as templates for the deposition of aluminum oxide masks. The masks are transferred into an underlying gold or silver layer by argon ion milling, which leads to nanocones or nanopyramids with very sharp tips. Near the tips the enhancement of an external electromagnetic field is particularly strong. This fact is confirmed by numerical simulations and by luminescence imaging in a confocal microscope. Such localized strong fields can amongst others be utilized for high-resolution, high-sensitivity spectroscopy and sensing of molecules near the tip. Arrays of such plasmonic nanostructures thus constitute controllable platforms for surface-enhanced Raman spectroscopy. A thin film of pentacene molecules is evaporated onto both nanocone and nanopyramid substrates, and the observed Raman enhancement is evaluated. PMID:24302595

  2. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    SciTech Connect

    Cancelo, G.; Gottschalk, Erik Edward; Pavlicek, V.; Wang, M.; Wu, J.

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  3. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

    SciTech Connect

    Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.; Catalyurek, Umit V.; Kurc, Tahsin; Sadayappan, Ponnuswamy; Saltz, Joel H.

    2009-08-01

    Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.

  4. Algorithm-Based Error Detection Of A Cholesky Factor Updating Systolic Array Using Cordic Processors

    NASA Astrophysics Data System (ADS)

    Chou, S. I.; Rader, Charles M.

    1989-12-01

    Lincoln Laboratory has developed an architecture for a folded linear systolic array using fixed-point CORDIC processors, applicable to adaptive nulling for a radar sidelobe canceler. The algorithm implemented uses triangularization by Givens rotations to solve a least-squares problem in the voltage domain. In this paper, the implementation of an inexpensive algorithm-based error-detection scheme is proposed for this systolic array. Column average checksum encoding is intended to detect most errors caused by the failure of any single arithmetic unit. It retains or almost retains the 100% processor utilization of Lincoln Laboratory's novel design. For the case of 64 degrees of freedom, the increase in time complexity is only 3%. The increase in hardware is mainly two adders and two comparators per CORDIC processor. We believe that the small increase in cost will be amply offset by the improvement in system performance brought about by this error detection.

  5. Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition

    DOEpatents

    Tomkins, James L.; Camp, William J.

    2007-07-17

    A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

  6. Implementation of context independent code on a new array processor: The Super-65

    NASA Technical Reports Server (NTRS)

    Colbert, R. O.; Bowhill, S. A.

    1981-01-01

    The feasibility of rewriting standard uniprocessor programs into code which contains no context-dependent branches is explored. Context independent code (CIC) would contain no branches that might require different processing elements to branch different ways. In order to investigate the possibilities and restrictions of CIC, several programs were recoded into CIC and a four-element array processor was built. This processor (the Super-65) consisted of three 6502 microprocessors and the Apple II microcomputer. The results obtained were somewhat dependent upon the specific architecture of the Super-65 but within bounds, the throughput of the array processor was found to increase linearly with the number of processing elements (PEs). The slope of throughput versus PEs is highly dependent on the program and varied from 0.33 to 1.00 for the sample programs.

  7. Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Gibson, Garth Alan

    1990-01-01

    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

  8. Software development on the High-Speed Systolic Array Processor (HISSAP): Lessons learned. Final report, Mar 88-Mar 91

    SciTech Connect

    Tirpak, F.M.

    1991-06-01

    This report documents the lessons learned in programming the Naval Ocean System Center's (NOSC's) High-Speed Systolic Array Processor (HISSAP) testbed. The procedures used for code generation, along with the programming utilities provided in the software development environment, are discussed with regard to their impact on the efficient implementation of algorithms on a parallel processing system such as HISSAP. This information is intended for considerations pertaining to software-development environments in future Navy parallel processing systems. Many of HISSAP's software-development utilities played key roles in the implementation of two computationally intensive algorithms: the Multiple-Signal Classification algorithm (MUSIC) and a four-channel, narrowband, finite-impulse response (FIR) filter. The introduction of utilities not included with the HISSAP tools would undoubtedly have increased the speed and efficiency of software development.

  9. Parallel algorithms for arbitrary dimensional Euclidean distance transforms with applications on arrays with reconfigurable optical buses.

    PubMed

    Wang, Yuh-Rau; Horng, Shi-Jinn

    2004-02-01

    In this paper, we present algorithms for computing the Euclidean distance transform (EDT) of a binary image on the array with reconfigurable optical buses (AROB). First, we develop a parallel algorithm termed as Algorithm Expander which can be implemented in O(1) time on an AROB with N x Ndelta processors, where delta = 1/k, k is a constant and a positive integer. Algorithm Expander is designed to compute a higher dimensional EDT based on the computed lower dimensional EDT. It functions as a general EDT expander for us to expand EDT from a lower dimension to a higher dimension. We then develop parallel algorithms for the two-dimensional (2-D)_EDT of a binary image array of size N x N in O(1) time on an AROB with N x N x Ndelta processors and for the three-dimensional (3-D)_EDT of a binary image of size N x N x N in O(1) time on an AROB with N x N x N x Ndelta processors. To the best of our knowledge, all results derived above are the best O(1) time algorithms known. We then extend it to compute the nD_EDT of a binary image of size Nn in O(n) time on an AROB with Nn+delta processors. We also apply our parallel EDT algorithms to build Voronoi diagram and Voronoi polyhetra (polygons), to find all maximal empty spheres and the largest empty sphere, and to compute the medial axis transform. All of these applications can be solved in the same time complexity on an AROB with the same number of processors as needed for solving the EDT problems in the same dimensions. PMID:15369089

  10. Parallelization of the Ensemble Empirical Model Decomposition (PEEMD) Method on Multi- and Many-core Processors

    NASA Astrophysics Data System (ADS)

    Cheung, S.; Shen, B.; Li, J. F.; Mehrotra, P.

    2013-12-01

    Cheung, S.1, B.-W. Shen2, P. Mehrotra1 , J.-L. F. Li3 1 NASA Ames Research Center, 2 UMCP/ESSIC, 3CalTech/JPL The trend in high performance computing systems is towards clusters of multi-core nodes; from an 8 cores/node Intel Xeon Harpertown processor in 2008 to the latest Intel Xeon Ivy Bridge processor with 24 cores/node. In addition hardware vendors are developing many core coprocessors, such as NVIDIA's General Purpose Graphics Processing Unit (GPGPU) and Intel's Xeon Phi, in order to get around the constraints of power and frequency. The hybrid nature of such systems presents a major challenge for software developers, in achieving the desired performance. Applications need to be constructed with multiple levels of parallelization along with hybrid communication regimes in order to exploit the power of such systems. The Ensemble Empirical Model Decomposition (EEMD) method has been applied to signal processing on nonlinear and non-stationary data. Due to the ensemble nature of the algorithm and the geographical decomposition of the problem, we have developed a parallel version of the EEMD method with 4-level parallelization, from the grid decomposition level, to time-series level and to the ensemble level using MPI and OpenMP. The parallel EEMD (PEEMD) is being used to analyze Hurricane Sandy (2012) for better understanding of the multiple scale processes that may have impacted Sandy's movement, intensification and formation. In this presentation, we summarize our experiences with the implementation of the PEEMD focusing on the programmability and usability of different processors and accelerators for multiscale analysis for Hurricane Sandy.

  11. Parallel algorithms for computational geometry utilizing a fixed number of processors

    SciTech Connect

    Strader, R.G.

    1988-01-01

    The design of algorithms for systems where both communication and computation are important is presented. Approaches to parallel computation and the underlying theoretical models are surveyed. two models of computation are developed, both based on a divide-and-conquer strategy. The first utilizes a tree-like merge resulting in several levels of communication and computation, the total number determined by the number of processors. The second model contains a fixed number of levels independent of the number of processors. Using the notation from the survey and the models of computation, algorithms are designed for the computational geometry problems of finding the convex hull and Delaunay triangulation for a set of uniform random points in the Euclidean plane. Communication and computation timing measurements based on these algorithms are presented and analyzed. The results are then generalized to predict the behavior of expanded problems. Architectural support, partitioning issues, and limitations of this approach are discussed.

  12. Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies

    NASA Astrophysics Data System (ADS)

    Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio

    2011-11-01

    Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.

  13. Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

    SciTech Connect

    Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

    2010-01-01

    An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

  14. Series-parallel method of direct solar array regulation

    NASA Technical Reports Server (NTRS)

    Gooder, S. T.

    1976-01-01

    A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

  15. Direct methods for banded linear systems on massively parallel processor computers

    SciTech Connect

    Arbenz, P.; Gander, W.

    1995-12-01

    The authors discuss direct methods for solving systems of linear equations Ax = b, A {element_of} lR{sup nxn}, on massively parallel processor (MPP) computers. Here, A is a real banded n x n matrix with lower and upper half-bandwidth r and s, respectively. We assume that the matrix A has a narrow band, meaning r + s << n. Only in this case, it is worthwhile taking into account the zero structure of A, i.e. store the matrix by diagonals and modify algorithms.

  16. Estimating water flow through a hillslope using the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Devaney, Judy E.; Camillo, P. J.; Gurney, R. J.

    1988-01-01

    A new two-dimensional model of water flow in a hillslope has been implemented on the Massively Parallel Processor at the Goddard Space Flight Center. Flow in the soil both in the saturated and unsaturated zones, evaporation and overland flow are all modelled, and the rainfall rates are allowed to vary spatially. Previous models of this type had always been very limited computationally. This model takes less than a minute to model all the components of the hillslope water flow for a day. The model can now be used in sensitivity studies to specify which measurements should be taken and how accurate they should be to describe such flows for environmental studies.

  17. Stochastic simulation of charged particle transport on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Earl, James A.

    1988-01-01

    Computations of cosmic-ray transport based upon finite-difference methods are afflicted by instabilities, inaccuracies, and artifacts. To avoid these problems, researchers developed a Monte Carlo formulation which is closely related not only to the finite-difference formulation, but also to the underlying physics of transport phenomena. Implementations of this approach are currently running on the Massively Parallel Processor at Goddard Space Flight Center, whose enormous computing power overcomes the poor statistical accuracy that usually limits the use of stochastic methods. These simulations have progressed to a stage where they provide a useful and realistic picture of solar energetic particle propagation in interplanetary space.

  18. Block iterative restoration of astronomical images with the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Heap, Sara R.; Lindler, Don J.

    1987-01-01

    A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

  19. A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors

    PubMed Central

    Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

    2009-01-01

    A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 μm Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 μm × 9.5 μm and each processing element size is 23 μm × 29 μm. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking. PMID:22454565

  20. The Square Kilometre Array Science Data Processor. Preliminary compute platform design

    NASA Astrophysics Data System (ADS)

    Broekema, P. C.; van Nieuwpoort, R. V.; Bal, H. E.

    2015-07-01

    The Square Kilometre Array is a next-generation radio-telescope, to be built in South Africa and Western Australia. It is currently in its detailed design phase, with procurement and construction scheduled to start in 2017. The SKA Science Data Processor is the high-performance computing element of the instrument, responsible for producing science-ready data. This is a major IT project, with the Science Data Processor expected to challenge the computing state-of-the art even in 2020. In this paper we introduce the preliminary Science Data Processor design and the principles that guide the design process, as well as the constraints to the design. We introduce a highly scalable and flexible system architecture capable of handling the SDP workload.

  1. Animated computer graphics models of space and earth sciences data generated via the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

    1987-01-01

    The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

  2. Parallel scheduling of recursively defined arrays

    NASA Technical Reports Server (NTRS)

    Myers, T. J.; Gokhale, M. B.

    1986-01-01

    A new method of automatic generation of concurrent programs which constructs arrays defined by sets of recursive equations is described. It is assumed that the time of computation of an array element is a linear combination of its indices, and integer programming is used to seek a succession of hyperplanes along which array elements can be computed concurrently. The method can be used to schedule equations involving variable length dependency vectors and mutually recursive arrays. Portions of the work reported here have been implemented in the PS automatic program generation system.

  3. Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2011-04-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

  4. Parallel collective resonances in arrays of gold nanorods.

    PubMed

    Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

    2014-01-01

    In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically. PMID:24645987

  5. Application of an array processor to the analysis of magnetic data for the Doublet III tokamak

    SciTech Connect

    Wang, T.S.; Saito, M.T.

    1980-08-01

    Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed.

  6. Run-time recognition of task parallelism within the P++ parallel array class library

    SciTech Connect

    Parsons, R.; Quinlan, D.

    1993-11-01

    This paper explores the use of a run-time system to recognize task parallelism with a C++ array class library. Run-time systems currently support data parallelism in P++, FORTRAN 90 D, and High Performance FORTRAN. But data parallelism in insufficient for many applications, including adaptive mesh refinement. Without access to both data and task parallelism such applications exhibit several orders of magnitude more message passing and poor performance. In this work, a C++ array class library is used to implement deferred evaluation and run-time dependence for task parallelism recognition, tp obtain task parallelism through a data flow interpretation of data parallel array statements. Performance results show that that analysis and optimizations are both efficient and practical, allowing us to consider more substantial optimizations.

  7. Realization of a neuronal hardware with digital signal processor and programmable gate arrays

    NASA Astrophysics Data System (ADS)

    Meyer-Baese, Anke; Meyer-Baese, Uwe; Scheich, Henning

    1995-04-01

    In this paper we describe how the processing speed of a radial basis neural network can be performed by the use of field programmable gate arrays (FPGA). The calculation of the very time-consuming exponential function is taken by an optimized CORDIC-processor. We determine the number of the necessary FPGAs and do a processing speed comparison between FPGA and DSP referring to an application in speech recognition.

  8. High-performance FFT implementation on the BOPS ManArray parallel DSP

    NASA Astrophysics Data System (ADS)

    Pitsianis, Nikos P.; Pechanek, Gerald

    1999-11-01

    We present a high performance implementation of the FFT algorithm on the BOPS ManArray parallel DSP processor. The ManArray we consider for this application consists of an array controller and 2 to 4 fully interconnected processing elements. To expose the parallelism inherent to an FFT algorithm we use a factorization of the DFT matrix in Kronecker products, permutation and diagonal matrices. Our implementation utilizes the multiple levels of parallelism that are available on the ManArray. We use the special multiply complex instruction, that calculates the product of two complex 32-bit fixed point numbers in 2 cycles (pipelinable). Instruction level parallelism is exploited via the indirect Very Long Instruction Word (iVLIW). With an iVLIW, in the same cycle a complex number is read from memory, another complex number is written to memory, a complex multiplication starts and another finishes, two complex additions or subtractions are done and a complex number is exchanged with another processing element. Multiple local FFTs are executed in Single Instruction Multiple Data (SIMD) mode, and to avoid a costly data transposition we execute distributed FFTs in Synchronous Multiple Instructions Multiple Data (SMIMD) mode.

  9. Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Dimpsey, Robert Tod

    1992-01-01

    In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.

  10. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    NASA Astrophysics Data System (ADS)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  11. High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

    DOEpatents

    Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.

    2002-01-01

    As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

  12. Evaluation of the Intel iWarp parallel processor for space flight applications

    NASA Technical Reports Server (NTRS)

    Hine, Butler P., III; Fong, Terrence W.

    1993-01-01

    The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.

  13. On-board landmark navigation and attitude reference parallel processor system

    NASA Technical Reports Server (NTRS)

    Gilbert, L. E.; Mahajan, D. T.

    1978-01-01

    An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.

  14. Parallel Spectral Acquisition with an Ion Cyclotron Resonance Cell Array.

    PubMed

    Park, Sung-Gun; Anderson, Gordon A; Navare, Arti T; Bruce, James E

    2016-01-19

    Mass measurement accuracy is a critical analytical figure-of-merit in most areas of mass spectrometry application. However, the time required for acquisition of high-resolution, high mass accuracy data limits many applications and is an aspect under continual pressure for development. Current efforts target implementation of higher electrostatic and magnetic fields because ion oscillatory frequencies increase linearly with field strength. As such, the time required for spectral acquisition of a given resolving power and mass accuracy decreases linearly with increasing fields. Mass spectrometer developments to include multiple high-resolution detectors that can be operated in parallel could further decrease the acquisition time by a factor of n, the number of detectors. Efforts described here resulted in development of an instrument with a set of Fourier transform ion cyclotron resonance (ICR) cells as detectors that constitute the first MS array capable of parallel high-resolution spectral acquisition. ICR cell array systems consisting of three or five cells were constructed with printed circuit boards and installed within a single superconducting magnet and vacuum system. Independent ion populations were injected and trapped within each cell in the array. Upon filling the array, all ions in all cells were simultaneously excited and ICR signals from each cell were independently amplified and recorded in parallel. Presented here are the initial results of successful parallel spectral acquisition, parallel mass spectrometry (MS) and MS/MS measurements, and parallel high-resolution acquisition with the MS array system. PMID:26669509

  15. A processor-time-minimal systolic array for cubical mesh algorithms

    SciTech Connect

    Cappello, P. . Dept. of Computer Science)

    1992-01-01

    Using a directed acyclic graph (dag) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's dag first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n {times} n {times} n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh requires 3n - 2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least (3n{sup 2/4}). A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly (3n{sup 2/4}) processing elements: it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically- connected, 2-D directed mesh.

  16. Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications

    NASA Technical Reports Server (NTRS)

    Bruno, John

    1984-01-01

    The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.

  17. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGESBeta

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; Anderson, Steve; Woodward, Paul; Dietz, Hank

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  18. Developments of 60 ghz antenna and wireless interconnect inside multi-chip module for parallel processor system

    NASA Astrophysics Data System (ADS)

    Yeh, Ho-Hsin

    In order to carry out the complicated computation inside the high performance computing (HPC) systems, tens to hundreds of parallel processor chips and physical wires are required to be integrated inside the multi-chip package module (MCM). The physical wires considered as the electrical interconnects between the processor chips, however, have the challenges on placements and routings because of the unequal progress between the semiconductor and I/O size reductions. The primary goal of the research is to overcome package design challenges---providing a hybrid computing architecture with implemented 60 GHz antennas as the high efficient wireless interconnect which could generate over 10 Gbps bandwidth on the data transmissions. The dissertation is divided into three major parts. In the first part, two different performance metrics, power loss required to be recovered ( PRE) and wireless link budget, on evaluating the antenna's system performance within the chip to chip wireless interconnect are introduced to address the design challenges and define the design goals. The second part contains the design concept, fabrication procedure and measurements of implemented 60 GHz broadband antenna in the application of multi-chip data transmissions. The developed antenna utilizes the periodically-patched artificial magnetic conductor (AMC) structure associated with the ground-shielded conductor in order to enhance the antenna's impedance matching bandwidth. The validation presents that over 10 GHz -10 dB S11 bandwidth which indicates the antenna's operating bandwidth and the horizontal data transmission capability which is required by planar type chip to chip interconnect can be achieved with the design concept. In order to reduce both PRE and wireless link budget numbers, a 60 GHz two-element array in the multi-chip communication is developed in the third part. The third section includes the combined-field analysis, the design concepts on two-element array and feeding

  19. Parallel arrays of Josephson junctions for submillimeter local oscillators

    NASA Technical Reports Server (NTRS)

    Pance, Aleksandar; Wengler, Michael J.

    1992-01-01

    In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

  20. Effects of rotation on turbulent convection: Direct numerical simulation using parallel processors

    NASA Astrophysics Data System (ADS)

    Chan, Daniel Chiu-Leung

    A new parallel implicit adaptive mesh refinement (AMR) algorithm is developed for the prediction of unsteady behaviour of laminar flames. The scheme is applied to the solution of the system of partial-differential equations governing time-dependent, two- and three-dimensional, compressible laminar flows for reactive thermally perfect gaseous mixtures. A high-resolution finite-volume spatial discretization procedure is used to solve the conservation form of these equations on body-fitted multi-block hexahedral meshes. A local preconditioning technique is used to remove numerical stiffness and maintain solution accuracy for low-Mach-number, nearly incompressible flows. A flexible block-based octree data structure has been developed and is used to facilitate automatic solution-directed mesh adaptation according to physics-based refinement criteria. The data structure also enables an efficient and scalable parallel implementation via domain decomposition. The parallel implicit formulation makes use of a dual-time-stepping like approach with an implicit second-order backward discretization of the physical time, in which a Jacobian-free inexact Newton method with a preconditioned generalized minimal residual (GMRES) algorithm is used to solve the system of nonlinear algebraic equations arising from the temporal and spatial discretization procedures. An additive Schwarz global preconditioner is used in conjunction with block incomplete LU type local preconditioners for each sub-domain. The Schwarz preconditioning and block-based data structure readily allow efficient and scalable parallel implementations of the implicit AMR approach on distributed-memory multi-processor architectures. The scheme was applied to solutions of steady and unsteady laminar diffusion and premixed methane-air combustion and was found to accurately predict key flame characteristics. For a premixed flame under terrestrial gravity, the scheme accurately predicted the frequency of the natural

  1. Parallel optical readout of cantilever arrays in dynamic mode.

    PubMed

    Koelmans, W W; van Honschoten, J; de Vries, J; Vettiger, P; Abelmann, L; Elwenspoek, M C

    2010-10-01

    Parallel frequency readout of an array of cantilevers is demonstrated using optical beam deflection with a single laser-diode pair. Multi-frequency addressing makes the individual nanomechanical response of each cantilever distinguishable within the received signal. Addressing is accomplished by exciting the array with the sum of all cantilever resonant frequencies. This technique requires considerably less hardware compared to other parallel optical readout techniques. Readout is demonstrated in beam deflection mode and interference mode. Many cantilevers can be readout in parallel, limited by the oscillators' quality factor and available bandwidth. The proposed technique facilitates parallelism in applications at the nano-scale, including probe-based data storage and biological sensing. PMID:20820095

  2. Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

    NASA Astrophysics Data System (ADS)

    Kumaki, Takeshi; Ishizaki, Masakatsu; Koide, Tetsushi; Mattausch, Hans Jürgen; Kuroda, Yasuto; Gyohten, Takayuki; Noda, Hideyuki; Dosaka, Katsumi; Arimoto, Kazutami; Saito, Kazunori

    This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

  3. Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor

    NASA Technical Reports Server (NTRS)

    Field, E. I.

    1975-01-01

    The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.

  4. NOSC (Naval Ocean Systems Center) advanced systolic array processor (ASAP). Professional paper for period ending August 1987

    SciTech Connect

    Loughlin, J.P.

    1987-12-01

    Design of a high-speed (250 million 32-bit floating-point operations per second) two-dimensional systolic array composed of 16-bit/slice microsequencer structured processors is presented. System-design features such as broadcast data flow, tag bit movement, and integrated diagnostic test registers are described. The software development tools needed to map complex matrix-based signal-processing algorithms onto the systolic-processor system are described.

  5. Constructing higher order DNA origami arrays using DNA junctions of anti-parallel/parallel double crossovers

    NASA Astrophysics Data System (ADS)

    Ma, Zhipeng; Park, Seongsu; Yamashita, Naoki; Kawai, Kentaro; Hirai, Yoshikazu; Tsuchiya, Toshiyuki; Tabata, Osamu

    2016-06-01

    DNA origami provides a versatile method for the construction of nanostructures with defined shape, size and other properties; such nanostructures may enable a hierarchical assembly of large scale architecture for the placement of other nanomaterials with atomic precision. However, the effective use of these higher order structures as functional components depends on knowledge of their assembly behavior and mechanical properties. This paper demonstrates construction of higher order DNA origami arrays with controlled orientations based on the formation of two types of DNA junctions: anti-parallel and parallel double crossovers. A two-step assembly process, in which preformed rectangular DNA origami monomer structures themselves undergo further self-assembly to form numerically unlimited arrays, was investigated to reveal the influences of assembly parameters. AFM observations showed that when parallel double crossover DNA junctions are used, the assembly of DNA origami arrays occurs with fewer monomers than for structures formed using anti-parallel double crossovers, given the same assembly parameters, indicating that the configuration of parallel double crossovers is not energetically preferred. However, the direct measurement by AFM force-controlled mapping shows that both DNA junctions of anti-parallel and parallel double crossovers have homogeneous mechanical stability with any part of DNA origami.

  6. Mobile and replicated alignment of arrays in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  7. Parallel nanoimaging and nanolithography using a heated microcantilever array

    NASA Astrophysics Data System (ADS)

    Somnath, Suhas; Kim, Hoe Joon; Hu, Huan; King, William P.

    2014-01-01

    We report parallel topographic imaging and nanolithography using heated microcantilever arrays integrated into a commercial atomic force microscope (AFM). The array has five AFM cantilevers, each of which has an internal resistive heater. The temperatures of the cantilever heaters can be monitored and controlled independently and in parallel. We perform parallel AFM imaging of a region of size 550 μm × 90 μm, where the cantilever heat flow signals provide a measure of the nanometer-scale substrate topography. At a cantilever scan speed of 1134 μm s-1, we acquire a 3.1 million-pixel image in 62 s with noise-limited vertical resolution of 0.6 nm and pixels of size 351 nm × 45 nm. At a scan speed of 4030 μm s-1 we acquire a 26.4 million pixel image in 124 s with vertical resolution of 5.4 nm and pixels of size 44 nm × 43 nm. Finally, we demonstrate parallel nanolithography with the cantilever array, including iterations of measure-write-measure nanofabrication, with each cantilever operating independently.

  8. Generalized schemes for access and alignment of data in parallel processors with self-routing interconnection networks

    SciTech Connect

    Boppana, R.V.; Raghavendra, C.S. )

    1991-02-01

    In this paper the authors give a generalized solution to the problem of conflict-free access of various templates of data of a matrix, when they are stored in memory units in a parallel processor. The important features of the method are: compact representation of a skewing scheme, simple address computation, use of self- routing schemes to set up the interconnection network, and a general framework for the study of skewing schemes. In the method, each template access of interest will be a linear permutation on the processor address. The linear permutation involved determines the types of templates accessible. For parallel access of the most important templates, namely, row, column, main diagonal, and square blocks, the interconnection network needs to realize only the class of linear-complement permutations. It is known that with Benes or Omega as the interconnection network, one can efficiently self-route these permutations; this compares favorably with the schemes proposed by other researchers who assume that a cross bar is available for processor-memory interconnections. Hence, the approach given in the paper can be used to solve the data alignment problem for the existing parallel machines such as IBM RP3, Cedar multiprocessor, and NYU Ultracomputer. This is a generalized solution to the data skewing problem and encompasses the previous efforts by other researchers as special cases.

  9. A Retargetable Compiler Based on Graph Representation for Dynamically Reconfigurable Processor Arrays

    NASA Astrophysics Data System (ADS)

    Tunbunheng, Vasutan; Amano, Hideharu

    For developing design environment of various Dynamically Reconfigurable Processor Arrays (DRPAs), the Graph with Configuration Information (GCI) is proposed to represent configurable resource in the target dynamically reconfigurable architecture. The functional unit, constant unit, register, and routing resource can be represented in the graph as well as the configuration information. The restriction in the hardware is also added in the graph by limiting the possible configuration at a node controlled by the other node. A prototype compiler called Black-Diamond with GCI is now available for three different DRPAs. It translates data-flow graph from C-like front-end description, applies placement and routing by using the GCI, and generates configuration data for each element of the DRPA. Evaluation results of simple applications show that Black-Diamond can generate reasonable designs for all three different architectures. Other target architectures can be easily treated by representing many aspects of architectural property into a GCI.

  10. Solar-pumped Nd:Cr:GSGG parallel array laser

    NASA Astrophysics Data System (ADS)

    Thompson, George A.; Krupkin, V.; Yogev, Amnon; Oron, Moshe

    1992-12-01

    A compact, parallel array of three Nd:Cr:GSGG laser rods is used to construct a quasi-CW laser. The array is pumped by concentrated solar light and is mounted in a single concentrator. The three laser rods use a common pair of laser mirrors to define the optical resonator. The three laser beams are not coherently coupled in these experiments. The simplicity of the design, and its reasonable stability in terms of vibration and optical misalignment, suggest that the design may be scalable for higher power.

  11. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    NASA Technical Reports Server (NTRS)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  12. Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems

    NASA Technical Reports Server (NTRS)

    Das, T. P.

    1988-01-01

    The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

  13. Parallel Processing of Large Scale Microphone Arrays for Sound Capture

    NASA Astrophysics Data System (ADS)

    Jan, Ea-Ee.

    1995-01-01

    Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be

  14. Performance evaluation of the HEP, ELXSI and CRAY X-MP parallel processors on hydrocode test problems

    SciTech Connect

    Liebrock, L.M.; McGrath, J.F.; Hicks, D.L.

    1986-07-07

    Parallel programming promises improved processing speeds for hydrocodes, magnetohydrocodes, multiphase flow codes, thermal-hydraulics codes, wavecodes and other continuum dynamics codes. This paper presents the results of some investigations of parallel algorithms on three parallel processors: the CRAY X-MP, ELXSI and the HEP computers. Introduction and Background: We report the results of investigations of parallel algorithms for computational continuum dynamics. These programs (hydrocodes, wavecodes, etc.) produce simulations of the solutions to problems arising in the motion of continua: solid dynamics, liquid dynamics, gas dynamics, plasma dynamics, multiphase flow dynamics, thermal-hydraulic dynamics and multimaterial flow dynamics. This report restricts its scope to one-dimensional algorithms such as the von Neumann-Richtmyer (1950) scheme.

  15. Parallel computation of optimized arrays for 2-D electrical imaging surveys

    NASA Astrophysics Data System (ADS)

    Loke, M. H.; Wilkinson, P. B.; Chambers, J. E.

    2010-12-01

    Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman-Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix-vector multiplications for a single add-on array were replaced with matrix-matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on

  16. Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.

    PubMed

    Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele

    2015-01-01

    Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable. PMID:26737215

  17. Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor

    NASA Technical Reports Server (NTRS)

    Moore, J. Strother

    1992-01-01

    Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

  18. Microchannel cross load array with dense parallel input

    DOEpatents

    Swierkowski, Stefan P.

    2004-04-06

    An architecture or layout for microchannel arrays using T or Cross (+) loading for electrophoresis or other injection and separation chemistry that are performed in microfluidic configurations. This architecture enables a very dense layout of arrays of functionally identical shaped channels and it also solves the problem of simultaneously enabling efficient parallel shapes and biasing of the input wells, waste wells, and bias wells at the input end of the separation columns. One T load architecture uses circular holes with common rows, but not columns, which allows the flow paths for each channel to be identical in shape, using multiple mirror image pieces. Another T load architecture enables the access hole array to be formed on a biaxial, collinear grid suitable for EDM micromachining (square holes), with common rows and columns.

  19. Fast space-filling molecular graphics using dynamic partitioning among parallel processors.

    PubMed

    Gertner, B J; Whitnell, R M; Wilson, K R

    1991-09-01

    We present a novel algorithm for the efficient generation of high-quality space-filling molecular graphics that is particularly appropriate for the creation of the large number of images needed in the animation of molecular dynamics. Each atom of the molecule is represented by a sphere of an appropriate radius, and the image of the sphere is constructed pixel-by-pixel using a generalization of the lighting model proposed by Porter (Comp. Graphics 1978, 12, 282). The edges of the spheres are antialiased, and intersections between spheres are handled through a simple blending algorithm that provides very smooth edges. We have implemented this algorithm on a multiprocessor computer using a procedure that dynamically repartitions the effort among the processors based on the CPU time used by each processor to create the previous image. This dynamic reallocation among processors automatically maximizes efficiency in the face of both the changing nature of the image from frame to frame and the shifting demands of the other programs running simultaneously on the same processors. We present data showing the efficiency of this multiprocessing algorithm as the number of processors is increased. The combination of the graphics and multiprocessor algorithms allows the fast generation of many high-quality images. PMID:1772836

  20. Parallel Syntheses of Peptides on Teflon-Patterned Paper Arrays (SyntArrays).

    PubMed

    Deiss, Frédérique; Yang, Yang; Derda, Ratmir

    2016-01-01

    Screening of peptides to find the ligands that bind to specific targets is an important step in drug discovery. These high-throughput screens require large number of structural variants of peptides to be synthesized and tested. This chapter describes the generation of arrays of peptides on Teflon-patterned sheets of paper. First, the protocol describes the patterning of paper with a Teflon solution to produce arrays with solvophobic barriers that are able to confine organic solvents. Next, we describe the parallel syntheses of 96 peptides on Teflon-patterned arrays using the SPOT synthesis method. PMID:26614081

  1. Enhanced interprocessor communication strategies for parallel TMS320C40 digital signal processor systems

    NASA Astrophysics Data System (ADS)

    Hartley, David A.; Harvey, David M.

    1993-11-01

    The Texas Instruments' TMS320C40 digital signal processor contains communication hardware which enables processors to be connected together to form multiprocessing systems. Analysis of the devices communication channels suggests that it would be beneficial to use additional communication hardware to maximize system performance. The use of mesh routing chips in conjunction with the processors has been investigated. The two devices are interfaced using two TMS320C40 communication channels. Lower message latencies can be achieved by using TMS320C40 communication channels to perform nearest neighbor communications while using the routing chips to perform all other message routing. However, the use of additional TMS320C40 channels can degrade the rate at which packets are injected and consumed from the network, resulting in under utilization of the network bandwidth.

  2. Scalable Unix commands for parallel processors : a high-performance implementation.

    SciTech Connect

    Ong, E.; Lusk, E.; Gropp, W.

    2001-06-22

    We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

  3. Parallel excitation with an array of transmit coils.

    PubMed

    Zhu, Yudong

    2004-04-01

    Theoretical and experimental results are presented that establish the value of parallel excitation with a transmit coil array in accelerating excitation and managing RF power deposition. While a 2D or 3D excitation pulse can be used to induce a multidimensional transverse magnetization pattern for a variety of applications (e.g., a 2D localized pattern for accelerating spatial encoding during signal acquisition), it often involves the use of prolonged RF and gradient pulses. Given a parallel system that is composed of multiple transmit coils with corresponding RF pulse synthesizers and amplifiers, the results suggest that by exploiting the localization characteristics of the coils, an orchestrated play of shorter RF pulses can achieve desired excitation profiles faster without adding strains to gradients. A closed-form design for accelerated multidimensional excitations is described for the small-tip-angle regime, and its suppression of interfering aliasing lobes from coarse excitation k-space sampling is interpreted based on an analogy to sensitivity encoding (SENSE). With or without acceleration, the results also suggest that by taking advantage of the extra degrees of freedom inherent in a parallel system, parallel excitation provides better management of RF power deposition while facilitating the faithful production of desired excitation profiles. Sample accelerated and specific absorption rate (SAR)-reduced excitation pulses were designed in this study, and evaluated in experiments. PMID:15065251

  4. A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor

    NASA Astrophysics Data System (ADS)

    Sotiropoulou, C. L.; Gkaitatzis, S.; Annovi, A.; Beretta, M.; Kordas, K.; Nikolaidis, S.; Petridou, C.; Volpi, G.

    2014-10-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of

  5. A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array

    NASA Astrophysics Data System (ADS)

    Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

    2013-05-01

    A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 μs time resolution of the gradient waveform, 1 μs time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ˜27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field.

  6. A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array.

    PubMed

    Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

    2013-05-01

    A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 μs time resolution of the gradient waveform, 1 μs time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ~27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field. PMID:23742570

  7. Breast ultrasound tomography with two parallel transducer arrays

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Gao, Kai; Intrator, Miranda; Hanson, Kenneth

    2016-03-01

    Breast ultrasound tomography is an emerging imaging modality to reconstruct the sound speed, density, and ultrasound attenuation of the breast in addition to ultrasound reflection/beamforming images for breast cancer detection and characterization. We recently designed and manufactured a new synthetic-aperture breast ultrasound tomography prototype with two parallel transducer arrays consisting of a total of 768 transducer elements. The transducer arrays are translated vertically to scan the breast in a warm water tank from the chest wall/axillary region to the nipple region to acquire ultrasound transmission and reflection data for whole-breast ultrasound tomography imaging. The distance of these two ultrasound transducer arrays is adjustable for scanning breasts with different sizes. We use our breast ultrasound tomography prototype to acquire phantom and in vivo patient ultrasound data to study its feasibility for breast imaging. We apply our recently developed ultrasound imaging and tomography algorithms to ultrasound data acquired using our breast ultrasound tomography system. Our in vivo patient imaging results demonstrate that our breast ultrasound tomography can detect breast lesions shown on clinical ultrasound and mammographic images.

  8. Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986

    SciTech Connect

    Golub, G.H.

    1987-04-30

    The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

  9. Nanocavity crossbar arrays for parallel electrochemical sensing on a chip

    PubMed Central

    Kätelhön, Enno; Mayer, Dirk; Banzet, Marko; Offenhäusser, Andreas

    2014-01-01

    Summary We introduce a novel device for the mapping of redox-active compounds at high spatial resolution based on a crossbar electrode architecture. The sensor array is formed by two sets of 16 parallel band electrodes that are arranged perpendicular to each other on the wafer surface. At each intersection, the crossing bars are separated by a ca. 65 nm high nanocavity, which is stabilized by the surrounding passivation layer. During operation, perpendicular bar electrodes are biased to potentials above and below the redox potential of species under investigation, thus, enabling repeated subsequent reactions at the two electrodes. By this means, a redox cycling current is formed across the gap that can be measured externally. As the nanocavity devices feature a very high current amplification in redox cycling mode, individual sensing spots can be addressed in parallel, enabling high-throughput electrochemical imaging. This paper introduces the design of the device, discusses the fabrication process and demonstrates its capabilities in sequential and parallel data acquisition mode by using a hexacyanoferrate probe. PMID:25161846

  10. Experimental results for a photonic time reversal processor for the adaptive control of an ultra wideband phased array antenna

    NASA Astrophysics Data System (ADS)

    Zmuda, Henry; Fanto, Michael; McEwen, Thomas

    2008-04-01

    This paper describes a new concept for a photonic implementation of a time reversed RF antenna array beamforming system. The process does not require analog to digital conversion to implement and is therefore particularly suited for high bandwidth applications. Significantly, propagation distortion due to atmospheric effects, clutter, etc. is automatically accounted for with the time reversal process. The approach utilizes the reflection of an initial interrogation signal from off an extended target to precisely time match the radiating elements of the array so as to re-radiate signals precisely back to the target's location. The backscattered signal(s) from the desired location is captured by each antenna and used to modulate a pulsed laser. An electrooptic switch acts as a time gate to eliminate any unwanted signals such as those reflected from other targets whose range is different from that of the desired location resulting in a spatial null at that location. A chromatic dispersion processor is used to extract the exact array parameters of the received signal location. Hence, other than an approximate knowledge of the steering direction needed only to approximately establish the time gating, no knowledge of the target position is required, and hence no knowledge of the array element time delay is required. Target motion and/or array element jitter is automatically accounted for. Presented here are experimental results that demonstrate the ability of a photonic processor to perform the time-reversal operation on ultra-short electronic pulses.

  11. A longitudinal multi-bunch feedback system using parallel digital signal processors

    SciTech Connect

    Sapozhnikov, L.; Fox, J.D.; Olsen, J.J.; Oxoby, G.; Linscott, I.; Drago, A.; Serio, M.

    1993-12-01

    A programmable longitudinal feedback system based on four AT&T 1610 digital signal processors has been developed as a component of the PEP-II R&D program. This longitudinal quick prototype is a proof of concept for the PEP-II system and implements full-speed bunch-by-bunch signal processing for storage rings with bunch spacing of 4 ns. The design incorporates a phase-detector-based front end that digitizes the oscillation phases of bunchies at the 250 MHz crossing rate, four programmable signal processors that compute correction signals, and a 250-MHz hold buffer/kicker driver stage that applies correction signals back on the beam. The design implements a general-purpose, table-driven downsampler that allows the system to be operated at several accelerator facilities. The hardware architecture of the signal processing is described, and the software algorithms used in the feedback signal computation are discussed. The system configuration used for tests at the LBL Advanced Light Source is presented.

  12. O(1) time algorithms for computing histogram and Hough transform on a cross-bridge reconfigurable array of processors

    SciTech Connect

    Kao, T.; Horng, S.; Wang, Y.

    1995-04-01

    Instead of using the base-2 number system, we use a base-m number system to represent the numbers used in the proposed algorithms. Such a strategy can be used to design an O(T) time, T = (log(sub m) N) + 1, prefix sum algorithm for a binary sequence with N-bit on a cross-bridge reconfigurable array of processors using N processors, where the data bus is m-bit wide. Then, this basic operation can be used to compute the histogram of an n x n image with G gray-level value in constant time using G x n x n processors, and compute the Hough transform of an image with N edge pixels and n x n parameter space in constant time using n x n x N processors, respectively. This result is better than the previously known results proposed in the literature. Also, the execution time of the proposed algorithms is tunable by the bus bandwidth. 43 refs.

  13. Implementation of monitors with macros: a programming aid for the HEP and other parallel processors

    SciTech Connect

    Lusk, E.L.; Overbeek, R.A.

    1983-12-01

    In a previous paper, the advantages of using monitors when implementing multiprocessing algorithms for the Denelcor HEP were delineated. A detailed presentation is given here of how monitors can be implementd on the HEP using a simple macro processor. The thesis is developed that a small body of general-purpose monitors can be defined to handle most standard synchronization patterns. We include the macro packages required to implement some of the more common synchronization patterns, including the fairly complex logic discussed before. Code produced using these macro packages is portable from one multiprocessing environment to another. Indeed, by recoding the set of basic macros (about 100 lines of code for the Denelcor HEP), most programs that are now being written could be moved to any similar multiprocessing system.

  14. Fast String Search on Multicore Processors: Mapping fundamental algorithms onto parallel hardware

    SciTech Connect

    Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

    2008-04-01

    String searching is one of these basic algorithms. It has a host of applications, including search engines, network intrusion detection, virus scanners, spam filters, and DNA analysis, among others. The Cell processor, with its multiple cores, promises to speed-up string searching a lot. In this article, we show how we mapped string searching efficiently on the Cell. We present two implementations: • The fast implementation supports a small dictionary size (approximately 100 patterns) and provides a throughput of 40 Gbps, which is 100 times faster than reference implementations on x86 architectures. • The heavy-duty implementation is slower (3.3-4.3 Gbps), but supports dictionaries with tens of thousands of strings.

  15. Failure analysis in a highly parallel processor for L1 triggering

    SciTech Connect

    Cancelo, G.; Gottschalk, Erik Edward; Pavlicek, V.; Wang, M.; Wu, J.

    2003-12-01

    This paper studies how processor failures affect the dataflow of the Level 1 Trigger in the BTeV experiment proposed to run at Fermilab's Tevatron. The failure analysis is crucial for a system with over 2500 processing nodes and a number of storage units and communication links of the same order of magnitude. This paper is based on models of the L1 Trigger architecture and shows the dynamics of the architecture's dataflow. The dataflow analysis provides insight into how system variables are affected by single component failures and provides key information to the implementation of error recovery strategies. The analysis includes both short-term failures from which the system can recover quickly and long-term failures which imply a more drastic error-recovery strategy. The modeling results are supported by behavioral simulations of the L1 Trigger processing BTeV's GEANT Monte Carlo data.

  16. Low-power, real-time digital video stabilization using the HyperX parallel processor

    NASA Astrophysics Data System (ADS)

    Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

    2011-06-01

    Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

  17. Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2012-01-01

    The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

  18. Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays.

    PubMed

    Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin

    2016-01-01

    In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301

  19. Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays

    PubMed Central

    Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin

    2016-01-01

    In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301

  20. Media integration platform on superhigh-definition images: parallel digital signal processor approach

    NASA Astrophysics Data System (ADS)

    Ono, Sadayasu; Ohta, Naohisa; Fujii, Tetsuro

    1993-11-01

    This paper presents a new media integration platform based on super high definition (SHD) digital images and a high performance image processing system that adopts parallel digital processing. The new platform will encourage the integration of all existing media to realize rich and realistic visual communication over B-ISDN. SHD images have a resolution of more than 2048 X 2048 pixels and the frame rate is more than 60 frames/sec. To achieve the real-time compression of SHD moving images, parallel signal processing systems with a peak performance of 0.5 Tera Flops will be necessary. The specification requirements, focusing on the digital signal processing systems needed to achieve SHD image communication, are discussed.

  1. Multimode power processor

    DOEpatents

    O'Sullivan, George A.; O'Sullivan, Joseph A.

    1999-01-01

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

  2. Multimode power processor

    DOEpatents

    O'Sullivan, G.A.; O'Sullivan, J.A.

    1999-07-27

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

  3. Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

    SciTech Connect

    Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

    2005-09-20

    At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

  4. Block-Level Added Redundancy Explicit Authentication for Parallelized Encryption and Integrity Checking of Processor-Memory Transactions

    NASA Astrophysics Data System (ADS)

    Elbaz, Reouven; Torres, Lionel; Sassatelli, Gilles; Guillemin, Pierre; Bardouillet, Michel; Martinez, Albert

    The bus between the System on Chip (SoC) and the external memory is one of the weakest points of computer systems: an adversary can easily probe this bus in order to read private data (data confidentiality concern) or to inject data (data integrity concern). The conventional way to protect data against such attacks and to ensure data confidentiality and integrity is to implement two dedicated engines: one performing data encryption and another data authentication. This approach, while secure, prevents parallelizability of the underlying computations. In this paper, we introduce the concept of Block-Level Added Redundancy Explicit Authentication (BL-AREA) and we describe a Parallelized Encryption and Integrity Checking Engine (PE-ICE) based on this concept. BL-AREA and PE-ICE have been designed to provide an effective solution to ensure both security services while allowing for full parallelization on processor read and write operations and optimizing the hardware resources. Compared to standard encryption which ensures only confidentiality, we show that PE-ICE additionally guarantees code and data integrity for less than 4% of run-time performance overhead.

  5. Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations

    SciTech Connect

    Cleveland, Mathew A. Brunner, Thomas A.; Gentile, Nicholas A.; Keasler, Jeffrey A.

    2013-10-15

    We describe and compare different approaches for achieving numerical reproducibility in photon Monte Carlo simulations. Reproducibility is desirable for code verification, testing, and debugging. Parallelism creates a unique problem for achieving reproducibility in Monte Carlo simulations because it changes the order in which values are summed. This is a numerical problem because double precision arithmetic is not associative. Parallel Monte Carlo, both domain replicated and decomposed simulations, will run their particles in a different order during different runs of the same simulation because the non-reproducibility of communication between processors. In addition, runs of the same simulation using different domain decompositions will also result in particles being simulated in a different order. In [1], a way of eliminating non-associative accumulations using integer tallies was described. This approach successfully achieves reproducibility at the cost of lost accuracy by rounding double precision numbers to fewer significant digits. This integer approach, and other extended and reduced precision reproducibility techniques, are described and compared in this work. Increased precision alone is not enough to ensure reproducibility of photon Monte Carlo simulations. Non-arbitrary precision approaches require a varying degree of rounding to achieve reproducibility. For the problems investigated in this work double precision global accuracy was achievable by using 100 bits of precision or greater on all unordered sums which where subsequently rounded to double precision at the end of every time-step.

  6. Improving the performance of molecular dynamics simulations on parallel clusters.

    PubMed

    Borstnik, Urban; Hodoscek, Milan; Janezic, Dusanka

    2004-01-01

    In this article a procedure is derived to obtain a performance gain for molecular dynamics (MD) simulations on existing parallel clusters. Parallel clusters use a wide array of interconnection technologies to connect multiple processors together, often at different speeds, such as multiple processor computers and networking. It is demonstrated how to configure existing programs for MD simulations to efficiently handle collective communication on parallel clusters with processor interconnections of different speeds. PMID:15032512

  7. Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

    NASA Astrophysics Data System (ADS)

    Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

    2014-05-01

    Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is

  8. Method of up-front load balancing for local memory parallel processors

    NASA Technical Reports Server (NTRS)

    Baffes, Paul Thomas (Inventor)

    1990-01-01

    In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

  9. Multimedia OC12 parallel interface using VCSEL array to achieve high-performance cost-effective optical interconnections

    NASA Astrophysics Data System (ADS)

    Chang, Edward S.

    1996-09-01

    The multimedia communication needs high-performance, cost- effective communication techniques to transport data for the fast-growing multimedia traffic resulting from the recent deployment of World Wide Web (WWW), media-on-demand , and other multimedia applications. To transport a large volume, of multimedia data, high-performance servers are required to perform media processing and transfer. Typically, the high- performance multimedia server is a massively parallel processor with a high number of I/O ports, high storage capacity, fast signal processing, and excellent cost- performance. The parallel I/O ports of the server are connected to multiple clients through a network switch which uses parallel links in both switch-to-server and switch-to- client connections. In addition to media processing and storage, media communication is also a major function of the multimedia system. Without a high-performance communication network, a high-performance server can not deliver its full capacity of service to clients. Fortunately, there are many advanced communication technologies developed for networking, which can be adopted by the multimedia communication to economically deliver the full capacity of a high-performance multimedia service to clients. The VCSEL array technology has been developed for gigabit-rate parallel optical interconnections because of its high bandwidth, small-size, and easy-fabrication advantages. Several firms are developing multifiber, low-skew, low-cost ribbon cables to transfer signals form a VCSEL array. The OC12 SONET data-rate is widely used by high-performance multimedia communications for its high-data-rate and cost- effectiveness. Therefore, the OC12 VCSEL parallel optical interconnection is the ideal technology to meet the high- performance low-cost requirements for delivering affordable multimedia services to mass users. This paper describes a multimedia OC12 parallel optical interconnection using a VCSEL array transceiver, a multifiber

  10. Massively parallel computation of lattice associative memory classifiers on multicore processors

    NASA Astrophysics Data System (ADS)

    Ritter, Gerhard X.; Schmalz, Mark S.; Hayden, Eric T.

    2011-09-01

    Over the past quarter century, concepts and theory derived from neural networks (NNs) have featured prominently in the literature of pattern recognition. Implementationally, classical NNs based on the linear inner product can present performance challenges due to the use of multiplication operations. In contrast, NNs having nonlinear kernels based on Lattice Associative Memories (LAM) theory tend to concentrate primarily on addition and maximum/minimum operations. More generally, the emergence of LAM-based NNs, with their superior information storage capacity, fast convergence and training due to relatively lower computational cost, as well as noise-tolerant classification has extended the capabilities of neural networks far beyond the limited applications potential of classical NNs. This paper explores theory and algorithmic approaches for the efficient computation of LAM-based neural networks, in particular lattice neural nets and dendritic lattice associative memories. Of particular interest are massively parallel architectures such as multicore CPUs and graphics processing units (GPUs). Originally developed for video gaming applications, GPUs hold the promise of high computational throughput without compromising numerical accuracy. Unfortunately, currently-available GPU architectures tend to have idiosyncratic memory hierarchies that can produce unacceptably high data movement latencies for relatively simple operations, unless careful design of theory and algorithms is employed. Advantageously, some GPUs (e.g., the Nvidia Fermi GPU) are optimized for efficient streaming computation (e.g., concurrent multiply and add operations). As a result, the linear or nonlinear inner product structures of NNs are inherently suited to multicore GPU computational capabilities. In this paper, the authors' recent research in lattice associative memories and their implementation on multicores is overviewed, with results that show utility for a wide variety of pattern

  11. Design and numerical evaluation of a volume coil array for parallel MR imaging at ultrahigh fields

    PubMed Central

    Pang, Yong; Wong, Ernest W.H.; Yu, Baiying

    2014-01-01

    In this work, we propose and investigate a volume coil array design method using different types of birdcage coils for MR imaging. Unlike the conventional radiofrequency (RF) coil arrays of which the array elements are surface coils, the proposed volume coil array consists of a set of independent volume coils including a conventional birdcage coil, a transverse birdcage coil, and a helix birdcage coil. The magnetic fluxes of these three birdcage coils are intrinsically cancelled, yielding a highly decoupled volume coil array. In contrast to conventional non-array type volume coils, the volume coil array would be beneficial in improving MR signal-to-noise ratio (SNR) and also gain the capability of implementing parallel imaging. The volume coil array is evaluated at the ultrahigh field of 7T using FDTD numerical simulations, and the g-factor map at different acceleration rates was also calculated to investigate its parallel imaging performance. PMID:24649435

  12. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    NASA Technical Reports Server (NTRS)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  13. Graphics-processor-unit-based parallelization of optimized baseline wander filtering algorithms for long-term electrocardiography.

    PubMed

    Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf

    2015-06-01

    Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced. PMID:25675449

  14. Computer Processor Allocator

    Energy Science and Technology Software Center (ESTSC)

    2004-03-01

    The Compute Processor Allocator (CPA) provides an efficient and reliable mechanism for managing and allotting processors in a massively parallel (MP) computer. It maintains information in a database on the health. configuration and allocation of each processor. This persistent information is factored in to each allocation decision. The CPA runs in a distributed fashion to avoid a single point of failure.

  15. Performance evaluation of the JPL interim digital SAR processor

    NASA Technical Reports Server (NTRS)

    Wu, C.; Barkan, B.; Curlander, J.; Jin, M.; Pang, S.

    1983-01-01

    The performance of the Interim Digital SAR Processor (IDP) was evaluated. The IDP processor was originally developed for experimental processing of digital SEASAT SAR data. One phase of the system upgrade which features parallel processing in three peripheral array processors, automated estimation for Doppler parameters, and unsupervised image pixel location determination and registration was executed. The method to compensate for the target range curvature effect was improved. A four point interpolation scheme is implemented to replace the nearest neighbor scheme used in the original IDP. The processor still maintains its fast throughput speed. The current performance and capability of the processing modes now available on the IDP system are updated.

  16. Appendix E: Parallel Pascal development system

    NASA Technical Reports Server (NTRS)

    1985-01-01

    The Parallel Pascal Development System enables Parallel Pascal programs to be developed and tested on a conventional computer. It consists of several system programs, including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal subprograms. The library includes subprograms for using Parallel Pascal on a parallel system with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently manipulate arrays which have dimensions than the hardware. Programs can be conveninetly tested with small sized arrays on the conventional computer before attempting to run on a parallel system.

  17. Analysis of a parallel-arrayed power regulating system

    NASA Technical Reports Server (NTRS)

    Colburn, B. K.; Horton, H. M.; Honnell, M. A.

    1979-01-01

    A power regulation system incorporating n-parallel power supplies employing PWM switching regulators is studied. Analysis of individual unit operation and coupled-system parameter sensitivity is considered from an operations viewpoint. A detailed example is included to illustrate parallel system operation for 18 such units powered by solar-cell banks.

  18. Building and using a highly parallel programmable logic array

    SciTech Connect

    Gokhale, M.; Holmes, W.; Kopser, A.; Lucas, S.; Minnich, R.; Sweely, D. ); Lopresti, D. )

    1991-01-01

    With a $13,000 two-slot addition called Splash, a Sun workstation can outperform a Cray-2 on certain applications. Several applications, most involving bit-stream computations, have been run on Splash, which received a 1989 Gordon Bell Prize honorable mention for timings on a problem that compared a new DNA sequence against a library of sequences to find the closest match. In essence, Splash is a programmable linear logic array that can be configured to suit the problem at hand; it bridges the gap between the traditional fixed-function VLSI systolic array and the more versatile programmable array. As originally conceived, a systolic array is a collection of simple processing elements, along with a one- or two-dimensional nearest-neighbor communication pattern. The local nature of the communication gives the systolic array a high communications bandwidth, and the simple, fixed function gives a high packing density for VLSI implementation.

  19. VLSI processor with a configurable processing element array for balanced feature extraction in high-resolution images

    NASA Astrophysics Data System (ADS)

    Zhu, Hongbo; Shibata, Tadashi

    2014-01-01

    A VLSI processor employing a configurable processing element array (PEA) is developed for a newly proposed balanced feature extraction algorithm. In the algorithm, the input image is divided into square regions and the number of features is determined by noise effect analysis in each region. Regions of different sizes are used according to the resolutions and contents of input images. Therefore, inside the PEA, processing elements are hierarchically grouped for feature extraction in regions of different sizes. A proof-of-concept chip is fabricated using a 0.18 µm CMOS technology with a 32 × 32 PEA. From measurement results, a speed of 7.5 kfps is achieved for feature extraction in 128 × 128 pixel regions when operating the chip at 45 MHz, and a speed of 55 fps is also achieved for feature extraction in 1920 × 1080 pixel images.

  20. A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system

    NASA Technical Reports Server (NTRS)

    Olariu, S.; Schwing, J.; Zhang, J.

    1991-01-01

    A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

  1. High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures

    NASA Technical Reports Server (NTRS)

    Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

    1985-01-01

    Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

  2. Stream Processors

    NASA Astrophysics Data System (ADS)

    Erez, Mattan; Dally, William J.

    Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.

  3. Electrostatic quadrupole array for focusing parallel beams of charged particles

    DOEpatents

    Brodowski, John

    1982-11-23

    An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

  4. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  5. High-performance ultra-low power VLSI analog processor for data compression

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul (Inventor)

    1996-01-01

    An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

  6. Parallel array of independent thermostats for column separations

    DOEpatents

    Foret, Frantisek; Karger, Barry L.

    2005-08-16

    A thermostat array including an array of two or more capillary columns (10) or two or more channels in a microfabricated device is disclosed. A heat conductive material (12) surrounded each individual column or channel in array, each individual column or channel being thermally insulated from every other individual column or channel. One or more independently controlled heating or cooling elements (14) is positioned adjacent to individual columns or channels within the heat conductive material, each heating or cooling element being connected to a source of heating or cooling, and one or more independently controlled temperature sensing elements (16) is positioned adjacent to the individual columns or channels within the heat conductive material. Each temperature sensing element is connected to a temperature controller.

  7. A frequency and sensitivity tunable microresonator array for high-speed quantum processor readout

    NASA Astrophysics Data System (ADS)

    Whittaker, J. D.; Swenson, L. J.; Volkmann, M. H.; Spear, P.; Altomare, F.; Berkley, A. J.; Bumble, B.; Bunyk, P.; Day, P. K.; Eom, B. H.; Harris, R.; Hilton, J. P.; Hoskinson, E.; Johnson, M. W.; Kleinsasser, A.; Ladizinsky, E.; Lanting, T.; Oh, T.; Perminov, I.; Tolkacheva, E.; Yao, J.

    2016-01-01

    Superconducting microresonators have been successfully utilized as detection elements for a wide variety of applications. With multiplexing factors exceeding 1000 detectors per transmission line, they are the most scalable low-temperature detector technology demonstrated to date. For high-throughput applications, fewer detectors can be coupled to a single wire but utilize a larger per-detector bandwidth. For all existing designs, fluctuations in fabrication tolerances result in a non-uniform shift in resonance frequency and sensitivity, which ultimately limits the efficiency of bandwidth utilization. Here, we present the design, implementation, and initial characterization of a superconducting microresonator readout integrating two tunable inductances per detector. We demonstrate that these tuning elements provide independent control of both the detector frequency and sensitivity, allowing us to maximize the transmission line bandwidth utilization. Finally, we discuss the integration of these detectors in a multilayer fabrication stack for high-speed readout of the D-Wave quantum processor, highlighting the use of control and routing circuitry composed of single-flux-quantum loops to minimize the number of control wires at the lowest temperature stage.

  8. Achieving supercomputer performance for neural net simulation with an array of digital signal processors

    SciTech Connect

    Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W.

    1992-10-01

    Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

  9. Frequency and sensitivity tunable microresonator array for high-speed quantum processor readout

    NASA Astrophysics Data System (ADS)

    Hoskinson, Emile; Whittaker, J. D.; Swenson, L. J.; Volkmann, M. H.; Spear, P.; Altomare, F.; Berkley, A. J.; Bumble, B.; Bunyk, P.; Day, P. K.; Eom, B. H.; Harris, R.; Hilton, J. P.; Johnson, M. W.; Kleinsasser, A.; Ladizinsky, E.; Lanting, T.; Oh, T.; Perminov, I.; Tolkacheva, E.; Yao, J.

    Frequency multiplexed arrays of superconducting microresonators have been used as detectors in a variety of applications. The degree of multiplexing achievable is limited by fabrication variation causing non-uniform shifts in resonator frequencies. We have designed, implemented and characterized a superconducting microresonator readout that incorporates two tunable inductances per detector, allowing independent control of each detector frequency and sensitivity. The tunable inductances are adjusted using on-chip programmable digital-to-analog flux converters, which are programmed with a scalable addressing scheme that requires few external lines.

  10. Parallel RNA extraction using magnetic beads and a droplet array

    PubMed Central

    Shi, Xu; Chen, Chun-Hong; Gao, Weimin; Meldrum, Deirdre R.

    2015-01-01

    Nucleic acid extraction is a necessary step for most genomic/transcriptomic analyses, but it often requires complicated mechanisms to be integrated into a lab-on-a-chip device. Here, we present a simple, effective configuration for rapidly obtaining purified RNA from low concentration cell medium. This Total RNA Extraction Droplet Array (TREDA) utilizes an array of surface-adhering droplets to facilitate the transportation of magnetic purification beads seamlessly through individual buffer solutions without solid structures. The fabrication of TREDA chips is rapid and does not require a microfabrication facility or expertise. The process takes less than 5 minutes. When purifying mRNA from bulk marine diatom samples, its repeatability and extraction efficiency are comparable to conventional tube-based operations. We demonstrate that TREDA can extract the total mRNA of about 10 marine diatom cells, indicating that the sensitivity of TREDA approaches single-digit cell numbers. PMID:25519439

  11. Design and fabrication of diffractive microlens arrays with continuous relief for parallel laser direct writing.

    PubMed

    Tan, Jiubin; Shan, Mingguang; Zhao, Chenguang; Liu, Jian

    2008-04-01

    Diffractive microlens arrays with continuous relief are designed, fabricated, and characterized by using Fermat's principle to create an array of spots on the photoresist-coated surface of a substrate for parallel laser direct writing. Experimental results indicate that a diffraction efficiency of 71.4% and a spot size of 1.97 microm (FWHM) can be achieved at normal incidence and a writing laser wavelength of 441.6 nm with an array of F/4 fabricated on fused silica, and the developed array can be used to improve the utilization ratio of writing laser energy. PMID:18382568

  12. Development of a ground signal processor for digital synthetic array radar data

    NASA Technical Reports Server (NTRS)

    Griffin, C. R.; Estes, J. M.

    1981-01-01

    A modified APQ-102 sidelooking array radar (SLAR) in a B-57 aircraft test bed is used, with other optical and infrared sensors, in remote sensing of Earth surface features for various users at NASA Johnson Space Center. The video from the radar is normally recorded on photographic film and subsequently processed photographically into high resolution radar images. Using a high speed sampling (digitizing) system, the two receiver channels of cross-and co-polarized video are recorded on wideband magnetic tape along with radar and platform parameters. These data are subsequently reformatted and processed into digital synthetic aperture radar images with the image data available on magnetic tape for subsequent analysis by investigators. The system design and results obtained are described.

  13. Using a Cray Y-MP as an array processor for a RISC Workstation

    NASA Technical Reports Server (NTRS)

    Lamaster, Hugh; Rogallo, Sarah J.

    1992-01-01

    As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.

  14. Fully parallel write/read in resistive synaptic array for accelerating on-chip learning

    NASA Astrophysics Data System (ADS)

    Gao, Ligang; Wang, I.-Ting; Chen, Pai-Yu; Vrudhula, Sarma; Seo, Jae-sun; Cao, Yu; Hou, Tuo-Hung; Yu, Shimeng

    2015-11-01

    A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaO x /TiO2/Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-by-row write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ∼95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.

  15. Sequence information signal processor

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1999-01-01

    An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

  16. Transmissive Nanohole Arrays for Massively-Parallel Optical Biosensing

    PubMed Central

    2015-01-01

    A high-throughput optical biosensing technique is proposed and demonstrated. This hybrid technique combines optical transmission of nanoholes with colorimetric silver staining. The size and spacing of the nanoholes are chosen so that individual nanoholes can be independently resolved in massive parallel using an ordinary transmission optical microscope, and, in place of determining a spectral shift, the brightness of each nanohole is recorded to greatly simplify the readout. Each nanohole then acts as an independent sensor, and the blocking of nanohole optical transmission by enzymatic silver staining defines the specific detection of a biological agent. Nearly 10000 nanoholes can be simultaneously monitored under the field of view of a typical microscope. As an initial proof of concept, biotinylated lysozyme (biotin-HEL) was used as a model analyte, giving a detection limit as low as 0.1 ng/mL. PMID:25530982

  17. Experimental verification of SNR and parallel imaging improvements using composite arrays.

    PubMed

    Maunder, Adam; Fallone, B Gino; Daneshmand, Mojgan; De Zanche, Nicola

    2015-02-01

    Composite MRI arrays consist of triplets where two orthogonal upright loops are placed over the same imaging area as a standard surface coil. The optimal height of the upright coils is approximately half the width for the 7 cm coils used in this work. Resistive and magnetic coupling is shown to be negligible within each coil triplet. Experimental evaluation of imaging performance was carried out on a Philips 3 T Achieva scanner using an eight-coil composite array consisting of three surface coils and five upright loops, as well as an array of eight surface coils for comparison. The composite array offers lower overall coupling than the traditional array. The sensitivities of upright coils are complementary to those of the surface coils and therefore provide SNR gains in regions where surface coil sensitivity is low, and additional spatial information for improved parallel imaging performance. Near the surface of the phantom the eight-channel surface coil array provides higher overall SNR than the composite array, but this advantage disappears beyond a depth of approximately one coil diameter, where it is typically more challenging to improve SNR. Furthermore, parallel imaging performance is better with the composite array compared with the surface coil array, especially at high accelerations and in locations deep in the phantom. Composite arrays offer an attractive means of improving imaging performance and channel density without reducing the size, and therefore the loading regime, of surface coil elements. Additional advantages of composite arrays include minimal SNR loss using root-sum-of-squares combination compared with optimal, and the ability to switch from high to low channel density by merely selecting only the surface elements, unlike surface coil arrays, which require additional hardware. PMID:25388793

  18. Breast ultrasound tomography with two parallel transducer arrays: preliminary clinical results

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Intrator, Miranda; Hanson, Kenneth; Epstein, Katherine; Sandoval, Daniel; Williamson, Michael

    2015-03-01

    Ultrasound tomography has great potential to provide quantitative estimations of physical properties of breast tumors for accurate characterization of breast cancer. We design and manufacture a new synthetic-aperture breast ultrasound tomography system with two parallel transducer arrays. The distance of these two transducer arrays is adjustable for scanning breasts with different sizes. The ultrasound transducer arrays are translated vertically to scan the entire breast slice by slice and acquires ultrasound transmission and reflection data for whole-breast ultrasound imaging and tomographic reconstructions. We use the system to acquire patient data at the University of New Mexico Hospital for clinical studies. We present some preliminary imaging results of in vivo patient ultrasound data. Our preliminary clinical imaging results show promising of our breast ultrasound tomography system with two parallel transducer arrays for breast cancer imaging and characterization.

  19. Mitigation of cache memory using an embedded hard-core PPC440 processor in a Virtex-5 Field Programmable Gate Array.

    SciTech Connect

    Learn, Mark Walter

    2010-02-01

    Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not available to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.

  20. Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

    NASA Astrophysics Data System (ADS)

    Olson, Richard F.

    2013-05-01

    Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

  1. Parallel Beam Approximation for Calculation of Detection Efficiency of Crystals in PET Detector Arrays

    PubMed Central

    Komarov, Sergey; Song, Tae Yong; Wu, Heyu; Tai, Yuan-Chuan

    2014-01-01

    In this work we propose a parallel beam approximation for the computation of the detection efficiency of crystals in a PET detector array. In this approximation the detection efficiency of a crystal is estimated using the distance between source and the crystal and the pre-calculated detection cross section of the crystal in a crystal array which is calculated for a uniform parallel beam of gammas. The pre-calculated detection cross sections for a few representative incident angles and gamma energies can be used to create a look-up table to be used in simulation studies or practical implementation of scatter or random correction algorithms. Utilizing the symmetries of the square crystal array, the pre-calculated look-up tables can be relatively small. The detection cross sections can be measured experimentally, calculated analytically or simulated using a Monte Carlo (MC) approach. In this work we used a MC simulation that takes into account the energy windowing, Compton scattering and factors in the “block effect”. The parallel beam approximation was validated by a separate MC simulation using point sources located at different positions around a crystal array. Experimentally measured detection efficiencies were compared with Monte Carlo simulated detection efficiencies. Results suggest that the parallel beam approximation provides an efficient and accurate way to compute the crystal detection efficiency, which can be used for estimation of random and scatter coincidences for PET data corrections. PMID:25400292

  2. A class of parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1989-01-01

    Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

  3. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm

    PubMed Central

    Guo, Xinyu; Wang, Hong; Devabhaktuni, Vijay

    2012-01-01

    A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. PMID:25969747

  4. 10-channel fiber array fabrication technique for parallel optical coherence tomography system

    NASA Astrophysics Data System (ADS)

    Arauz, Lina J.; Luo, Yuan; Castillo, Jose E.; Kostuk, Raymond K.; Barton, Jennifer

    2007-02-01

    Optical Coherence Tomography (OCT) shows great promise for low intrusive biomedical imaging applications. A parallel OCT system is a novel technique that replaces mechanical transverse scanning with electronic scanning. This will reduce the time required to acquire image data. In this system an array of small diameter fibers is required to obtain an image in the transverse direction. Each fiber in the array is configured in an interferometer and is used to image one pixel in the transverse direction. In this paper we describe a technique to package 15μm diameter fibers on a siliconsilica substrate to be used in a 2mm endoscopic probe tip. Single mode fibers are etched to reduce the cladding diameter from 125μm to 15μm. Etched fibers are placed into a 4mm by 150μm trench in a silicon-silica substrate and secured with UV glue. Active alignment was used to simplify the lay out of the fibers and minimize unwanted horizontal displacement of the fibers. A 10-channel fiber array was built, tested and later incorporated into a parallel optical coherence system. This paper describes the packaging, testing, and operation of the array in a parallel OCT system.

  5. Anisotropic charge and heat conduction through arrays of parallel elliptic cylinders in a continuous medium

    NASA Astrophysics Data System (ADS)

    Martin, James E.; Ribaudo, Troy

    2013-04-01

    Arrays of circular pores in silicon can exhibit a phononic bandgap when the lattice constant is smaller than the phonon scattering length, and so have become of interest for use as thermoelectric materials, due to the large reduction in thermal conductivity that this bandgap can cause. The reduction in electrical conductivity is expected to be less, because the lattice constant of these arrays is engineered to be much larger than the electron scattering length. As a result, electron transport through the effective medium is well described by the diffusion equation, and the Seebeck coefficient is expected to increase. In this paper, we develop an expression for the purely diffusive thermal (or electrical) conductivity of a composite comprised of square or hexagonal arrays of parallel circular or elliptic cylinders of one material in a continuum of a second material. The transport parallel to the cylinders is straightforward, so we consider the transport in the two principal directions normal to the cylinders, using a self-consistent local field calculation based on the point dipole approximation. There are two limiting cases: large negative contrast (e.g., pores in a conductor) and large positive contrast (conducting pillars in air). In the large negative contrast case, the transport is only slightly affected parallel to the major axis of the elliptic cylinders but can be significantly affected parallel to the minor axis, even in the limit of zero volume fraction of pores. The positive contrast case is just the opposite: the transport is only slightly affected parallel to the minor axis of the pillars but can be significantly affected parallel to the major axis, even in the limit of zero volume fraction of pillars. The analytical results are compared to extensive FEA calculations obtained using Comsol™ and the agreement is generally very good, provided the cylinders are sufficiently small compared to the lattice constant.

  6. Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit

    SciTech Connect

    Daily, Jeffrey A.; Lewis, Robert R.

    2011-11-30

    Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

  7. Implementation of monitors with macros: a programming aid for the HEP and other parallel processors. Rev. 1

    SciTech Connect

    Lusk, E.L.; Overbeek, R.A.

    1984-07-01

    In this report we give a detailed presentation of how monitors can be implemented on the HEP using a simple macro processor. We then develop the thesis that a small body of general-purpose monitors can be defined to handle most standard synchronization patterns. We include the macro packages required to implement some of the more common synchronization patterns, including the fairly complex logic discussed in a previous paper. Code produced using these macro packages is portable from one multiprocessing environment to another. Indeed, by recoding the set of basic macros (about 100 lines of code for the Denelcor HEP), most programs that we are new writing could be moved to any similar multiprocessing system.

  8. Parallel processors and nonlinear structural dynamics algorithms and software. Semiannual progress report, 1 March-31 August 1988

    SciTech Connect

    Belytschko, T.

    1989-04-01

    A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

  9. Performance of the UCAN2 Gyrokinetic Particle In Cell (PIC) Code on Two Massively Parallel Mainframes with Intel ``Sandy Bridge'' Processors

    NASA Astrophysics Data System (ADS)

    Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul

    2013-10-01

    The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.

  10. Pressure-driven perfusion culture microchamber array for a parallel drug cytotoxicity assay.

    PubMed

    Sugiura, Shinji; Edahiro, Jun-ichi; Kikuchi, Kyoko; Sumaru, Kimio; Kanamori, Toshiyuki

    2008-08-15

    This article reports a pressure-driven perfusion culture chip developed for parallel drug cytotoxicity assay. The device is composed of an 8 x 5 array of cell culture microchambers with independent perfusion microchannels. It is equipped with a simple interface for convenient access by a micropipette and connection to an external pressure source, which enables easy operation without special training. The unique microchamber structure was carefully designed with consideration of hydrodynamic parameters and was fabricated out of a polydimethylsiloxane by using multilayer photolithography and replica molding. The microchamber structure enables uniform cell loading and perfusion culture without cross-contamination between neighboring microchambers. A parallel cytotoxicity assay was successfully carried out in the 8 x 5 microchamber array to analyze the cytotoxic effects of seven anticancer drugs. The pressure-driven perfusion culture chip, with its simple interface and well-designed microfluidic network, will likely become an advantageous platform for future high-throughput drug screening by microchip. PMID:18553395