Sample records for parallel processor array

  1. Design Space Exploration for Massively Parallel Processor Arrays

    Microsoft Academic Search

    Frank Hannig; Jürgen Teich

    2001-01-01

    In this paper, we describe an approach for the optimiza- tion of dedicated co-processors that are implemented either in hardware (ASIC) or congware (FPGA). Such massively parallel co-processors are typically part of a heterogeneous hardware\\/software-system. Each co- processor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational

  2. Method of simulating additional processors in a simd parallel processor array

    Microsoft Academic Search

    W. D. Hillis; C. Lasser; B. Kahle; K. Sims

    1988-01-01

    This patent describes a single-instruction multiple-data (SIMD) parallel processor comprising a controller and an array of processors controlled in parallel by the controller, each processor comprising an identical input, an identical output, an identical processing element and an identical memory associated with each processing element, the processing element operating in accordance with instructions provided by the controller on data provided

  3. Titanic: a VLSI based content addressable parallel array processor

    SciTech Connect

    Weems, C.; Levitan, S.; Foster, C.

    1982-01-01

    A design is presented for a content addressable parallel array processor (CAPAP) which is both practical and feasible. Its practicality stems from an extensive program of research into real applications of content addressability and parallelism. The feasibility of the design stems from development under a set of conservative engineering constraints tied to limitations of VLSI technology. 1 ref.

  4. Digital Parallel Processor Array for Optimum Path Planning

    NASA Technical Reports Server (NTRS)

    Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

    1996-01-01

    The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

  5. Integration of IR focal plane arrays with massively parallel processor

    NASA Astrophysics Data System (ADS)

    Esfandiari, P.; Koskey, P.; Vaccaro, K.; Buchwald, W.; Clark, F.; Krejca, B.; Rekeczky, C.; Zarandy, A.

    2008-04-01

    The intent of this investigation is to replace the low fill factor visible sensor of a Cellular Neural Network (CNN) processor with an InGaAs Focal Plane Array (FPA) using both bump bonding and epitaxial layer transfer techniques for use in the Ballistic Missile Defense System (BMDS) interceptor seekers. The goal is to fabricate a massively parallel digital processor with a local as well as a global interconnect architecture. Currently, this unique CNN processor is capable of processing a target scene in excess of 10,000 frames per second with its visible sensor. What makes the CNN processor so unique is that each processing element includes memory, local data storage, local and global communication devices and a visible sensor supported by a programmable analog or digital computer program.

  6. Scalable parallel processor array for Jacobi-type matrix computations

    Microsoft Academic Search

    H. W. van Dijk; G. J. Hekstra; E. F. Deprettere

    1995-01-01

    This paper addresses the problem of designing a family of potential processor arrays for the execution of the so-called Jacobi algorithms. It extends the more familiar problem of designing a single fixed-size processor array for a particular program and its is parametrised with respect to size in two ways. Firstly, the program is no longer a particular one but is

  7. VLSI Array processors

    Microsoft Academic Search

    S. Kung

    1985-01-01

    High speed signal processing depends critically on parallel processor technology. In most applications, general-purpose parallel computers cannot offer satisfactory real-time processing speed due to severe system overhead. Therefore, for real-time digital signal processing (DSP) systems, special-purpose array processors have become the only appealing alternative. In designing or using such array Processors, most signal processing algorithms share the critical attributes of

  8. A 64 parallel integrated memory array processor and a 30 GIPS real-time vision system

    Microsoft Academic Search

    Yoshihiro Fujita; Nobuyuki Yamashita; S. Okazaki

    1995-01-01

    Describes a parallel-processor LSI chip (the Integrated Memory Array Processor, IMAP) and a compact real-time vision system (RVS-2). The IMAP integrates 64 8-bit processors, which operate in a SIMD manner, and 2-Mbit image memory on a single chip, and has peak performance of 3.84 GIPS. The RVS-2 consists of 8 IMAPs, a video interface, a control LSI chip (the Real-time

  9. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  10. Array processor architecture

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1983-01-01

    A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

  11. The massively parallel processor

    NASA Technical Reports Server (NTRS)

    Schaefer, D. H.; Fischer, J. R.; Wallgren, K. R.

    1980-01-01

    Future sensor systems will utilize massively parallel computing systems for rapid analysis of two-dimensional data. The Goddard Space Flight Center has an ongoing program to develop these systems. A single-instruction multiple data computer known as the Massively Parallel Processor (MPP) is being fabricated for NASA by the Goodyear Aerospace Corporation. This processor contains 16,384 processing elements arranged in a 128 x 128 array. The MPP will be capable of adding more than 6 billion 8-bit numbers per second. Multiplication of eight-bit numbers can occur at a rate of 2 billion per second. Delivery of the MPP to Goddard Space Flight Center is scheduled for 1983.

  12. QUEN - The APL wavefront array processor

    Microsoft Academic Search

    Dolecek

    1989-01-01

    Developments in computer networks are making parallel processing machines accessible to an increasing number of scientists and engineers. Several vector and array processors are already commercially available, as are costly systolic, wavefront, and massive parallel processors. This article discusses the Applied Physics Laboratory's entry: a low-cost, memory-linked wavefront array processor that can be used as a peripheral on existing computers.

  13. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  14. PARALLEL IMPLEMENTATION OF FINITE DIFFERENCE SCHEMES FOR THE PLATE EQUATION ON A FPGA-BASED MULTI-PROCESSOR ARRAY

    Microsoft Academic Search

    E. Motuk; R. Woods; S. Bilbao

    The computational complexity of the finite difference (FD) schemes for the solution of the plate equation prevents them from being used in musical applications. The explicit FD schemes can be parallelized to run on multi-processor ar- rays for achieving real-time performance. Field Program- mable Gate Arrays (FPGAs) provide an ideal platform for implementing these architectures with the advantages of low-

  15. Spaceborne Processor Array

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  16. Array processors in chemistry

    SciTech Connect

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  17. The Associative Linear Array Processor

    Microsoft Academic Search

    Charles A. Finnila; Hubert H. Love Jr.

    1977-01-01

    The associative linear array processor (ALAP) is a new approach to making large associative processors practical. Data storage in shift registers, bit-serial arithmetic, LSI word cells comprehensive arithmetic capability within the memory array, and electronic fault isolation are all utilized. The processor is a linear array of word cells, each containing memory and arithmetic logic. All connections to the word

  18. Peripheral array processors; Proceedings of the conference, Boston, MA, October 11, 12, 1984

    Microsoft Academic Search

    1984-01-01

    Various papers on peripheral array processors are presented. The topics discussed include: the changing role of peripheral array processors; recent developments in the CSPI MAP series of array processors; parallel processing approach selected by Floating Point Systems for providing a new generation of cost effective array processors and scientific computers; the Numerix MARS-432 array processor; and broader horizons for the

  19. Finding Speedup in Parallel Processors

    Microsoft Academic Search

    Michael J. Flynn; Robert G. Dimond; Oskar Mencer; Oliver Pell

    2008-01-01

    While recently the focus of architects and programmers has been on multi core, the alternative of processor node plus array oriented accelerator has some significant advantages especially in compute intensive static applications. We propose an acceleration methodology based on FPGA arrays (but, in principle it could be GPU or Cell based). The methodology uses a comprehensive application analysis supported by

  20. An inner product processor design using novel parallel counter circuits

    Microsoft Academic Search

    Rong Lin; A. S. Botha; K. E. Kerr; G. A. Brown

    1999-01-01

    This paper presents a novel parallel inner product processor architecture. The proposed processor has the following features: (1) it can be easily reconfigured for computing inner products of input arrays with four or more types of structures. Typically, each input array may contain 64 8-bit items, or 16 16-bit items, or 4 32-bit items, or 1 64-bit item, with items

  1. Parallel processor engine model program

    NASA Technical Reports Server (NTRS)

    Mclaughlin, P.

    1984-01-01

    The Parallel Processor Engine Model Program is a generalized engineering tool intended to aid in the design of parallel processing real-time simulations of turbofan engines. It is written in the FORTRAN programming language and executes as a subset of the SOAPP simulation system. Input/output and execution control are provided by SOAPP; however, the analysis, emulation and simulation functions are completely self-contained. A framework in which a wide variety of parallel processing architectures could be evaluated and tools with which the parallel implementation of a real-time simulation technique could be assessed are provided.

  2. Parallel Analog-to-Digital Image Processor

    NASA Technical Reports Server (NTRS)

    Lokerson, D. C.

    1987-01-01

    Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

  3. Ring Array Processor (RAP): Software Architecture

    Microsoft Academic Search

    Jeff Bilmes; Phil Kohn

    1991-01-01

    The design and implementation of software for the Ring Array Processor (RAP), a highperformance parallel computer, involved development for three hardware platforms: SunSPARC workstations, Heurikon MC68020 boards running the VxWorks real-time operatingsystem, and Texas Instruments TMS320C30 DSPs. The RAP now runs in Sun workstationsunder UNIX and in a VME based system using VxWorks. A flexible set of tools hasbeen provided

  4. The AIS-5000 parallel processor

    SciTech Connect

    Schmitt, L.A.; Wilson, S.S.

    1988-05-01

    The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In this paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.

  5. An Integrated Memory Array Processor for Embedded Image Recognition Systems

    Microsoft Academic Search

    Shorin Kyo; Shin'ichiro Okazaki; Tamio Arai

    2007-01-01

    Embedded processors for video image recognition in most cases not only need to address the conventional cost (die size and power) versus real-time performance issue, but must also maintain high flexibility due to the immense diversity of recognition targets, situations, and applications. This paper describes IMAP, a highly parallel SIMD linear processor and memory array architecture that addresses these trade-off

  6. Bibliographic Pattern Matching Using the ICL Distributed Array Processor.

    ERIC Educational Resources Information Center

    Carroll, David M.; And Others

    1988-01-01

    Describes the use of a highly parallel array processor for pattern matching operations in a bibliographic retrieval system. The discussion covers the hardware and software features of the processor, the pattern matching algorithm used, and the results of experimental tests of the system. (37 references) (Author/CLB)

  7. Array Processor Has Power and Flexibility

    NASA Technical Reports Server (NTRS)

    Barnes, G. H.; Lundstrom, S. F.; Shafer, P. E.

    1982-01-01

    Proposed processor architecture would have flexibility of a multi-processor and computational power of a lockstep array. Using an efficient interconnection network, it accomodates a large number of individual processors and memory modules. Array architecture would be suitable for very large scientific simulation problems and other applications.

  8. Processing Remotely Sensed Data with Array Processors

    Microsoft Academic Search

    A. S. Margulies

    1976-01-01

    Array processors have been used extensively in military applications involving sonar and radar signal processing, but they have not been as widely employed in image processing applications. The constraints of limited word-size and limited programmability which previously made array processors unattractive have been mitigated with the architecture of the modern machines and the low cost of these modern processors relative

  9. Adaptively Parallel Processor Allocation for Cilk Jobs

    E-print Network

    Sen, Siddhartha

    The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known ...

  10. Ultrafast Fourier-transform parallel processor

    SciTech Connect

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  11. Task and instruction scheduling in parallel multithreaded processors

    E-print Network

    Mishra, Amitabh

    1996-01-01

    Parallel muitithreading is a technique to execute parallel programs on a multithreaded superscalar processor. It enhances instruction throughput in a processor by combining program parallelism with the strong features of superscalar...

  12. Parallel processor for fast event analysis

    SciTech Connect

    Hensley, D.C.

    1983-01-01

    Current maximum data rates from the Spin Spectrometer of approx. 5000 events/s (up to 1.3 MBytes/s) and minimum analysis requiring at least 3000 operations/event require a CPU cycle time near 70 ns. In order to achieve an effective cycle time of 70 ns, a parallel processing device is proposed where up to 4 independent processors will be implemented in parallel. The individual processors are designed around the Am2910 Microsequencer, the AM29116 ..mu..P, and the Am29517 Multiplier. Satellite histogramming in a mass memory system will be managed by a commercial 16-bit ..mu..P system.

  13. Three Dimensional Graphics Algorithms on the MicroGrain Array ProcessorII

    E-print Network

    Bishop, Benjamin

    Three Dimensional Graphics Algorithms on the Micro­Grain Array Processor­II Benjamin Bishop Yan algorithms that have been mapped to the Micro­Grain Array Processor (MGAP), an inexpensive and versatile SIMD a general purpose parallel machine. 1: Introduction The increasing demand for three dimensional graphics

  14. An integrated memory array processor architecture for embedded image recognition systems

    Microsoft Academic Search

    Shorin Kyo; S. Okazaki; T. Arai

    2005-01-01

    Embedded processors for video image recognition require to address both the cost (die size and power) versus real-time performance issue, and also to achieve high flexibility due to the immense diversity of recognition targets, situations, and applications. This paper describes IMAP, a highly parallel SIMD linear processor and memory array architecture that addresses these trading-off requirements. By using parallel and

  15. An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems

    Microsoft Academic Search

    Shorin Kyo; Shin'ichiro Okazaki; Tamio Arai

    2005-01-01

    Embedded processors for video image recognition require to address both the cost (die size and power) versus real-time performance issue, and also to achieve high flexibility due to the immense diversity of recognition targets, situations, and applications. This paper describes IMAP, a highly parallel SIMD linear processor and memory array architecture that addresses these trading-off requirements. By using parallel and

  16. Parallel processor programs in the Federal Government

    SciTech Connect

    Schneck, P.B.; Austin, D.; Squires, S.L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-06-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  17. Parallel processor programs in the Federal Government

    NASA Technical Reports Server (NTRS)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  18. Mapping between parallel processor structures and programs

    NASA Technical Reports Server (NTRS)

    Ngai, Tin-Fook; Yan, Jerry C.; Mak, Victor W. K.; Flynn, Michael J.; Lundstrom, Stephen F.

    1987-01-01

    This paper reports some ongoing research efforts at Stanford in allocation of parallel processing resources. Both processor structures and program structures have their own characteristics. Resource allocation binds the two structures during program execution. The mapping problem determines what processor structure and program structure may be combined to obtain maximum speedup. Three approaches to this mapping problem are considered. Two important factors, granularity and interaction delay, are also considered. A new hierarchical approach to structure definition is outlined. Effective and efficient tools are necessary for the study of the mapping problem. A fast turn-around simulation environment developed for investigating partition strategies for distributed computations and a computationally efficient method to predict performance of parallel processor structures are described.

  19. Fault-tolerant array processors using single-track switches

    SciTech Connect

    Kung, S.Y.; Jean, S.N.; Chang, C.W.

    1989-04-01

    An array processor is a collection of many similar processing elements (PED's), which can be executed in both parallel and pipeline processing. For the implementation of arrays of large number of processors, fault tolerance has always been a very critical design issue. Very often, spare PE's and switching lattices are incorporated in the array to improve the (fabrication-time) yield and the (run-time) reliability. In this paper, an array grid model based on single-track switches is proposed. A reconfigurability theorem is developed to provide the theoretical footing for new reconfiguration algorithms for the fabrication-time and run-time processing. For fabrication-time yield enhancement, the problem of finding a feasible reconfiguration using global control can be reformulated as a maximum independent set problem. An existing algorithm in graph theory is adopted to solve this problem.

  20. Performance limitations in parallel processor simulations

    NASA Technical Reports Server (NTRS)

    O'Grady, E. Pearse; Wang, Chung-Hsien

    1987-01-01

    A jet-engine model is partitioned and simulated on a parallel processor system consisting of five 8086/8087 floating-point computers. The simulation uses Heun's integration method. A near-optimal parallel simulation (in the sense of minimum execution time) achieves speedup of only 2.13 and efficiency of 42.6 percent, in effect wasting 57.4 percent of the available processing power. A detailed analysis identifies and graphically demonstrates why the system fails to achieve ideal performance (viz., speedup of 5 and efficiency of 100 percent). Inherent characteristics of the problem equations and solution algorithm account for the loss of nearly half of the available processing power. Overheads associated with interprocessor communication and processor synchronization account for only a small fraction of the lost processing power. The effects of these and other factors which limit parallel processor performance are illustrated through real-time timing-analyzer tracers describing the run/idle status of the parallel processors during the simulation.

  1. Two-dimensional mesh-connected parallel processor with complex processing elements

    Microsoft Academic Search

    Chaoyang Chen; Xubang Shen; Zhong Wang; Hongshi Sang

    2001-01-01

    LS MPP is a massively parallel processor .It has fine-grained parallelism with up to 4096 processing elements arranged in a SIMD architecture .The processing elements are arranged in 64x64 two-dimensional mesh-connected array for low-level image processing .In this paper, the system architecture ,the components of processing element ,array controller ,memory organization of LS MPP processor are described .In the final

  2. Computations on the massively parallel processor at the Goddard Space Flight Center

    Microsoft Academic Search

    JAMES P. STRONG

    1991-01-01

    Described are four significant algorithms implemented on the massively parallel processor (MPP) at the Goddard Space Flight Center. Two are in the area of image analysis. Of the other two, one is a mathematical simulation experiment and the other deals with the efficient transfer of data between distantly separated processors in the MPP array. The first algorithm presented is the

  3. Efficient searching and sorting applications using an associative array processor

    NASA Technical Reports Server (NTRS)

    Pace, W.; Quinn, M. J.

    1978-01-01

    The purpose of this paper is to describe a method of searching and sorting data by using some of the unique capabilities of an associative array processor. To understand the application, the associative array processor is described in detail. In particular, the content addressable memory and flip network are discussed because these two unique elements give the associative array processor the power to rapidly sort and search. A simple alphanumeric sorting example is explained in hardware and software terms. The hardware used to explain the application is the STARAN (Goodyear Aerospace Corporation) associative array processor. The software used is the APPLE (Array Processor Programming Language) programming language. Some applications of the array processor are discussed. This summary tries to differentiate between the techniques of the sequential machine and the associative array processor.

  4. Impact of VLSI on peripheral array processors

    Microsoft Academic Search

    1982-01-01

    Many modern applications, such as signal processing and real-time simulation, require large amounts of computation. This poses a dilemma: general-purpose computers are too costly for real-time applications, and special-purpose hardware is not flexible enough for research and development work. Peripheral array processors (PAP) can provide a cost-effective compromise that combines the best of both worlds: large amounts of computation performed

  5. Fast Hough Transform On A Mesh Connected Processor Array

    NASA Astrophysics Data System (ADS)

    Kannar, C. S.; Chuang, Henry Y. H.

    1988-02-01

    Hough transform is an effective method for the detection of the shape of object boundaries in image pattern analysis. Since the Hough transform is very computation intensive, it is essen-tial to parallelize the computation. However, an effective parallel algorithm is harder to obtain because it requires global informa-tion. In this paper we present an efficient parallel Hough transform algorithm for the detection of straight lines using mesh connected processor arrays. While other parallel algo-rithms take either 0(n2) or 0(n2) time, where n is the number of distinct values of a parameter and N is the number of edge pixels, our algorithm takes 0(n) time.

  6. Marching-pixels: a new organic computing paradigm for smart sensor processor arrays

    Microsoft Academic Search

    Dietmar Fey; Daniel Schmidt

    2005-01-01

    In this paper we present a new organic computing principle denoted as marching pixels for the architectures of future smart CMOS camera chips. The idea of marching pixels is based on the realization of a massively-parallel fine-grain single-chip processor array. Marching pixels are virtual organic units which are propagating in a pixel processor array, similar to virtual ants in ant

  7. A Sliding Memory Plane Array Processor

    Microsoft Academic Search

    Myung Hoon Sunwoo; J. K. Aggarwal

    1993-01-01

    A mesh-connected single-input multiple-data (SIMD) architecture called a sliding memory plane (SliM) array processor is proposed. Differing from existing mesh-connected SIMD architectures, SliM has several salient features such as a sliding memory plane that provides inter-PE communication during computation. Two I\\/O planes provide an I\\/O overlapping capability. Thus, inter-PE communication and I\\/O overhead can be overlapped with computation. Inter-PE communication

  8. Contextual classification on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Tilton, James C.

    1987-01-01

    Classifiers are often used to produce land cover maps from multispectral Earth observation imagery. Conventionally, these classifiers have been designed to exploit the spectral information contained in the imagery. Very few classifiers exploit the spatial information content of the imagery, and the few that do rarely exploit spatial information content in conjunction with spectral and/or temporal information. A contextual classifier that exploits spatial and spectral information in combination through a general statistical approach was studied. Early test results obtained from an implementation of the classifier on a VAX-11/780 minicomputer were encouraging, but they are of limited meaning because they were produced from small data sets. An implementation of the contextual classifier is presented on the Massively Parallel Processor (MPP) at Goddard that for the first time makes feasible the testing of the classifier on large data sets.

  9. Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor

    E-print Network

    Scott, Michael L.

    Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor Thomas J. Le of Rochester have used a collection of BBN Butterfly TM Parallel Processors to conduct research in parallel with the Butterfly we have ported three compilers, developed five major and several minor library packages, built two

  10. A new control structure for the pipelined CNN processor arrays

    Microsoft Academic Search

    Nerhun Yildiz; Evren Cesur; Vedat Tavsanoglu

    2010-01-01

    In this paper an improvement over the control structure of the processor architecture reported in is proposed. Each processor in the array was controlled by the central control unit which proved to have some setbacks. These are: 1) the complexity of the control logic which tends to be more complicated as the number of processors gets higher; 2) the necessity

  11. Scan line graphics generation on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1988-01-01

    Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

  12. Massively parallel MRI detector arrays.

    PubMed

    Keil, Boris; Wald, Lawrence L

    2013-04-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas via reception, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called "ultimate" SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  13. AsAP: An Asynchronous Array of Simple Processors

    Microsoft Academic Search

    Zhiyi Yu; Michael J. Meeuwsen; Ryan W. Apperson; Omar Sattari; Michael Lai; Jeremy W. Webb; Eric W. Work; Dean Truong; Tinoosh Mohsenin; Bevan M. Baas

    2008-01-01

    An array of simple programmable processors is implemented in 0.18 mum CMOS and contains 36 asynchronously clocked independent processors. Each processor occupies 0.66 and is fully functional at a clock rate of 520-540 MHz at 1.8 V and over 600 MHz at 2.0 V. Processors dissipate an average of 32 mW under typical conditions at 1.8 V and 475 MHz,

  14. Digital image processing software system using an array processor

    SciTech Connect

    Sherwood, R.J.; Portnoff, M.R.; Journeay, C.H.; Twogood, R.E.

    1981-03-10

    A versatile array processor-based system for general-purpose image processing was developed. At the heart of this system is an extensive, flexible software package that incorporates the array processor for effective interactive image processing. The software system is described in detail, and its application to a diverse set of applications at LLNL is briefly discussed. 4 figures, 1 table.

  15. A transformative approach to the partitioning of processor arrays

    Microsoft Academic Search

    Jiirgen Teich; Lothar Thiele

    1992-01-01

    The paper describes the systematic design of processor arrays with a given dimension and a given number of processing elements. The unified approach to the solution of this problem called partitioning is based on the following concepts: (1) Algorithms and processor arrays are represented by (piecewise regular) programs. (2) The concept of stepwise refinement of programs is used to solve

  16. Towards the automated design of application specific array processors (ASAPs)

    Microsoft Academic Search

    A. P. Marriott; A. W. G. DULLERZ; R. H. Storer; A. R. Thomson; M. R. Pout

    1990-01-01

    The authors describe the architecture and VLSI design of GLiTCH, an associative processor array chip designed for computer vision applications. The design is built from a library of cells, which can be used in conjunction with high level functional specifications to rapidly design new application specific array processors. The objective is to design a system which will allow application specific

  17. Bispectrum signal processing on HNC`s SIMD numerical array processor (SNAP)

    SciTech Connect

    Means, R.W.; Wallach, B.; Busby, D. [HNC, Inc., San Diego, CA (United States); Lengel, R.C. Jr. [Tracor Applied Sciences, Inc., Austin, TX (United States)

    1993-12-31

    Supercomputers and parallel processors are increasingly being applied to problems traditionally described as signal and image processing problems. The primary activities occurring in either processing area are detection, enhancement, and classification of signals embedded in additive noise. The bispectrum is a processing technique that can be used for improving the detection of signals in noise. It is an order N{sup 2} operation performed over a two dimensional frequency plane and, because of computational demands, has not been used much in practice. HNC has developed a commercially available SIMD Numerical Array Processor (SNAP) and implemented Tracor`s computationally demanding bispectrum signal processing code as a submission for the Gordon Bell prize. The SNAP is a SIMD array of parallel processors connected in a linear ring. A SNAP system with 32 processors (SNAP-32) demonstrated a performance of over 7.5 GIGA FLOP per million dollars.

  18. DFT algorithms for bit-serial GaAs array processor architectures

    NASA Technical Reports Server (NTRS)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  19. High-performance FFT implementation on the BOPS ManArray parallel DSP

    Microsoft Academic Search

    Nikos P. Pitsianis; Gerald Pechanek

    We present a high performance implementation of the FFT algorithm on the BOPS ManArray parallel DSP processor. The ManArray we consider for this application consists of an array controller and 2 to 4 fully interconnected processing elements. To expose the parallelism inherent to an FFT algorithm we use a factorization of the DFT matrix in Kronecker products, permutation and diagonal

  20. An accurate projection algorithm for array processor based SPECT systems

    SciTech Connect

    King, M.A.; Schwinger, R.B.; Cool, S.L.

    1985-05-01

    A data re-projection algorithm has been developed for use in single photon emission computed tomography (SPECT) on an array processor based computer system. The algorithm makes use of an accurate representation of pixel activity (uniform square pixel model of intensity distribution), and is rapidly performed due to the efficient handling of an array based algorithm and the Fast Fourier Transform (FFT) on parallel processing hardware. The algorithm consists of using a pixel driven nearest neighbour projection operation to an array of subdivided projection bins. This result is then convolved with the projected uniform square pixel distribution before being compressed to original bin size. This distribution varies with projection angle and is explicitly calculated. The FFT combined with a frequency space multiplication is used instead of a spatial convolution for more rapid execution. The new algorithm was tested against other commonly used projection algorithms by comparing the accuracy of projections of a simulated transverse section of the abdomen against analytically determined projections of that transverse section. The new algorithm was found to yield comparable or better standard error and yet result in easier and more efficient implementation on parallel hardware. Applications of the algorithm include iterative reconstruction and attenuation correction schemes and evaluation of regions of interest in dynamic and gated SPECT.

  1. R256: a research parallel processor for scientific computation

    Microsoft Academic Search

    T. Fukazawa; T. Kimura; M. Tomizawa; K. Takeda; Y. Itoh

    1989-01-01

    A scientific parallel processor called the R256 has been developed. The R256 is composed of 16x16 processing elements, and has the outstanding features of a “distributed parallel network” as well as on IEEE 80-bit extended floating point computation ability. The computation accuracy, required by an exhaustive number of iterations in scientific computations, is resolved by the dedicated 80-bit VLSI processor,

  2. Experience with a multiprocessor based on eight FPS 120B array processors

    SciTech Connect

    Bucher, I.Y.; Frederickson, P.O.; Moore, J.W.

    1981-01-01

    The rate of increase in the speed of monoprocessors is no longer keeping pace with the needs of the laboratory; accordingly, the use of parallel processors in large scientific computations is being investigated. As an initial experiment, a particle-in-cell plasma simulation was adapted to run on a star graph architecture consisting of a UNIVAC 1110 as hub, and up to eight Floating Point Systems AP120B array processors at the other vertices. Subdivision of tasks among processors and measured results are discussed.

  3. An embedded real-time SIMD processor array for image processing

    Microsoft Academic Search

    David Andrews; Cliff Kancler; Barry Wealand

    1996-01-01

    The paper presents an overview of the SuperSPAR (Systolic Processor Array) architecture and chip set. The SuperSPAR was designed by Lockheed Martin to bring the benefits of massively parallel SIMD processing to the embedded systems domain. The system philosophy focused on building a hierarchy of scaleable subarray modules allowing systems to be configured by “plugging together” any number of these

  4. Design and optimization of a defect tolerant processor array

    E-print Network

    Lakkapragada, Bhavani S

    1995-01-01

    In this thesis we design and optimization of a defect tolerant MIMD processor array, for maximum performance per wafer area, targeted at applications that have a large number of operations per memory word, is described. The optimization includes...

  5. Co-Design of Massively Parallel Embedded Processor Architectures

    Microsoft Academic Search

    Frank Hannig; Hritam Dutta; Alexey Kupriyanov; Jürgen Teich; Rainer Schaffer; Sebastian Siegel; Renate Merker; Ronan Keryell; Bernard Pottier; Daniel Chillet; Daniel Menard; Olivier Sentieys

    2005-01-01

    In this paper, we introduce a methodology for the sys- tematic mapping, evaluation, and exploration of massively parallel processor architectures that are designed for spe- cial purpose applications in the world of embedded comput- ers. The investigated class of computer architectures can be described by massively parallel networked processing elements that, using today's hardware technology, may be implemented on a

  6. Parallel modular multiplication on multi-core processors

    E-print Network

    Boyer, Edmond

    Parallel modular multiplication on multi-core processors Pascal Giorgi LIRMM, CNRS, UM2 Montpellier parallel modular multiplications. Famous methods such as Barrett, Montgomery as well as more recent algorithms are compared together with a novel k-ary multipartite multiplication which allows to split

  7. ProcessorEfficient Parallel Computation of Polynomial Greatest Common Divisors*

    E-print Network

    Kaltofen, Erich

    Processor­Efficient Parallel Computation of Polynomial Greatest Common Divisors* Erich Kaltofen@cs.rpi.edu Preliminary Report (July 1, 1989) 1. Introduction We present a parallel algebraic PRAM algorithm that can scheme on an algebraic circuit of size = O(n !+1 log(n)) and depth = O(log(n) 2 ) This more general

  8. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    DOEpatents

    Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  9. Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem

    NASA Astrophysics Data System (ADS)

    Sedukhin, Stanislav G.; Miyazaki, Toshiaki; Kuroda, Kenichi

    The algebraic path problem (APP) is a general framework which unifies several solution procedures for a number of well-known matrix and graph problems. In this paper, we present a new 3-dimensional (3-D) orbital algebraic path algorithm and corresponding 2-D toroidal array processors which solve the n × n APP in the theoretically minimal number of 3n time-steps. The coordinated time-space scheduling of the computing and data movement in this 3-D algorithm is based on the modular function which preserves the main technological advantages of systolic processing: simplicity, regularity, locality of communications, pipelining, etc. Our design of the 2-D systolic array processors is based on a classical 3-D?2-D space transformation. We have also shown how a data manipulation (copying and alignment) can be effectively implemented in these array processors in a massively-parallel fashion by using a matrix-matrix multiply-add operation.

  10. A multi?function parallel processor for binary image processing

    Microsoft Academic Search

    1988-01-01

    A hardware system described in this paper is a high?speed parallel machine which is called multi?function parallel processor (MFPP). All 3×3 maskable parallel algorithms concerned with the binary image processing can be implemented on MFPP such as smoothing, contouring, erosion, dilation, thinning, feature extraction, and so forth. A prototype of MFPP has been designed to work at 5 MHz clock

  11. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J. (Seattle, WA)

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  12. Multithreaded processor architecture for parallel symbolic computation. Technical report

    SciTech Connect

    Fujita, T.

    1987-09-01

    This paper describes the Multilisp Architecture for Symbolic Applications (MASA), which is a multithreaded processor architecture for parallel symbolic computation with various features intended for effective Multilisp program execution. The principal mechanisms exploited for this processor are multiple contexts, interleaved pipeline execution from separate instruction streams, and synchronization based on a bit in each memory cell. The tagged architecture approach is taken for Lisp program execution, and trap conditions are provided for future object manipulation and garbage collection.

  13. DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors

    Microsoft Academic Search

    Tao Yang; Apostolos Gerasoulis

    1994-01-01

    We present a low-complexity heuristic, named the dominant sequence clusteringalgorithm (DSC), for scheduling parallel tasks on an unbounded number of completelyconnected processors. The performance of DSC is on average, comparable to, or evenbetter than, other higher-complexity algorithms. We assume no task duplication andnonzero communication overhead between processors. Finding the optimum solution forarbitrary directed acyclic task graphs (DAG's) is NP-complete. DSC

  14. Regular processor arrays for matrix algorithms with pivoting

    Microsoft Academic Search

    V. P. Roychowdhury; T. Kailath

    1988-01-01

    It is shown how to obtain regular (though nonsystolic) processor arrays for algorithms with pivoting. First, the fact that pivoting algorithms cannot be systolic is established. Then it is shown how regular iterative algorithms can be formulated for the Gaussian elimination algorithm with partial pivoting and how the algorithm can then be implemented on the so-called regular iterative arrays (locally

  15. A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture

    Microsoft Academic Search

    Yasutaka Wada; Akihiro Hayashi; Takeshi Masuura; Jun Shirako; Hirofumi Nakano; Hiroaki Shikano; Keiji Kimura; Hironori Kasahara

    Heterogeneous multicore architectures integrating several kinds of accelerator cores in addition to general purpose processor cores have been attracting much attention to realize high performance with low power consumption. To attain effective high performance, high applica- tion software productivity, and low power consumption on heterogeneous multicores, cooperation between the architecture and the parallelizing compiler is important. This paper proposes a

  16. The ICAP parallel processor communications switch

    Microsoft Academic Search

    Deepak Rana; Charles C. Weems

    1989-01-01

    The architecture of a custom VLSI parallel communications switch (PARCOS) chip is described. The PARCOS chip consists of a communication matrix of 32-b serial inputs and 32-b serial outputs and an on-chip control memory. The control memory, called the connection pattern cache (CPC), is constructed so that PARCOS can hold up to 32 of the most frequently used connection patterns

  17. Automatic generation of synchronization instructions for parallel processors

    SciTech Connect

    Midkiff, S.P.

    1986-05-01

    The development of high speed parallel multi-processors, capable of parallel execution of doacross and forall loops, has stimulated the development of compilers to transform serial FORTRAN programs to parallel forms. One of the duties of such a compiler must be to place synchronization instructions in the parallel version of the program to insure the legal execution order of doacross and forall loops. This thesis gives strategies usable by a compiler to generate these synchronization instructions. It presents algorithms for reducing the parallelism in FORTRAN programs to match a target architecture, recovering some of the parallelism so discarded, and reducing the number of synchronization instructions that must be added to a FORTRAN program, as well as basic strategies for placing synchronization instructions. These algorithms are developed for two synchronization instruction sets. 20 refs., 56 figs.

  18. Parallel Array Processors for Digital Image Processing

    Microsoft Academic Search

    M. Tasto

    1977-01-01

    A major drawback of digital computer image processing is the large computation time required. On the other hand, its flexibility, programmability and computational accuracy make it desirable to use digital processing. Advances in technology of LSI circuitry have now made it possible to increase strongly the computational power of image processing systems by combining many ‘micro computers’ or processing elements

  19. Real-time trajectory optimization on parallel processors

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  20. The Architecture of the Butter y Plus Parallel Processor Department of Computer Science

    E-print Network

    Kotz, David

    CS{1988{6 The Architecture of the Butter y Plus Parallel Processor David Kotz Department of the Butter y Plus Parallel Processor David Kotz December 16, 1987 Abstract This paper investigates the architecture of the Butter y Plus Parallel Processor, an MIMD shared-memory machine based on the Motorola MC

  1. Automatic array alignment in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    FORTRAN 90 and other data-parallel languages express parallelism in the form of operations on data aggregates such as arrays. Misalignment of the operands of an array operation can reduce program performance on a distributed-memory parallel machine by requiring nonlocal data accesses. Determining array alignments that reduce communication is therefore a key issue in compiling such languages. We present a framework for the automatic determination of array alignments in array-based, data-parallel languages. Our language model handles array sectioning, reductions, spreads, transpositions, and masked operations. We decompose alignment functions into three constituents: axis, stride, and offset. For each of these subproblems, we show how to solve the alignment problem for a basic block of code, possibly containing common subexpressions. Alignments are generated for all array objects in the code, both named program variables and intermediate results. We assign computation to processors by virtue of explicit alignment of all temporaries; the resulting work assignment is in general better than that provided by the 'owner-computes' rule. Finally, we present some ideas for dealing with control flow, replication, and dynamic alignments that depend on loop induction variables.

  2. A taxonomy of reconfiguration techniques for fault-tolerant processor arrays--

    SciTech Connect

    Chean, M. (Shell Development Co., Houston, TX (USA)); Fortes, J.A.B. (Purdue Univ., Lafayette, IN (USA))

    1990-01-01

    The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.

  3. Staging memory for massively parallel processor

    NASA Technical Reports Server (NTRS)

    Batcher, Kenneth E. (Inventor)

    1988-01-01

    The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

  4. Parallel Univariate Real Root Isolation on Multicore Processors

    NASA Astrophysics Data System (ADS)

    Chen, Changbo; Maza, Marc Moreno; Xie, Yuzhen

    2011-11-01

    We present parallel algorithms with optimal cache complexity for the kernel routine of many real root isolation algorithms, namely, Taylor shift, targeting multicore processors. We then report an efficient multithreaded implementation for isolating the real roots of univariate polynomials based on the parallel Taylor shift algorithms. For processing some well-known benchmark examples with sufficiently large size, our software tool reaches linear speedup on a 8-core machine. In addition, we show that our software is able to fully utilize the many cores and the memory space of a 32-core machine to tackle large problems that are out of reach for a desktop implementation.

  5. A Two's Complement Parallel Array Multiplication Algorithm

    Microsoft Academic Search

    C. R. Baugh; B. A. Wooley

    1973-01-01

    An algorithm for high-speed, two's complement, m-bit by n-bit parallel array multiplication is described. The two's complement multiplication is converted to an equivalent parallel array addition problem in which each partial product bit is the AND of a multiplier bit and a multiplicand bit, and the signs of all the partial product bits are positive.

  6. Parallel Media Processors for the Billion-Transistor Era Jason Fritts, Zhao Wu, and Wayne Wolf

    E-print Network

    Fritts, Jason

    - level languages. Some programmable media processors have started to appear in the marketplace with DSP Parallel Media Processors for the Billion-Transistor Era Jason Fritts, Zhao Wu, and Wayne Wolf Dept}@ee.princeton.edu Abstract This paper describes the challenges presented by single- chip parallel media processors (PMPs

  7. VLSI array processor R&D status report

    NASA Astrophysics Data System (ADS)

    Greenwood, E.

    1982-01-01

    Detail design of the Arithmetic Processor Unit (APU) chip has been completed. All cell types (100) have been run through the design rule check (DRC) programs, corrected and verified. DRC runs on the entire chip have been run and all corrections have been made. Fifteen out of eighteen of the chip DRC corrections have been verified. The metal, polysilicon and information data layers of the APU layout is shown. The attached drawings, titled 'VLSI Array Processor Arithmetic Processor Unit Chip Plan' is a detail drawing of the APU Chip Plan. The functional level simulator of the APU has been built and verified using a set of APU diagnostic code. A gate level logic simulation of the APU has been built. The APU breadboard modules have been fabricated and check out has been initiated. The Array Processor Demonstration System (APDS) modules are in the wire-wrap process. The APDS and APU microcode assembler have been built and checked out. The linker and loader for the APDS have also been built.

  8. Analog parallel processor hardware for high speed pattern recognition

    NASA Technical Reports Server (NTRS)

    Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

    1990-01-01

    A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

  9. Parallel information transfer in a multinode quantum information processor.

    PubMed

    Borneman, T W; Granade, C E; Cory, D G

    2012-04-01

    We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel. PMID:22540778

  10. Semantic network array processor and its applications to image understanding

    SciTech Connect

    Dixit, V.; Moldovan, D.I.

    1987-01-01

    The problems in computer vision range from edge detection and segmentation at the lowest level to the problem of cognition at the highest level. This correspondence describes the organization and operation of a semantic network array processor (SNAP) as applicable to high level computer vision problems. The architecture consists of an array of identical cells each containing a content addressable memory, microprogram control, and a communication unit. The applications discussed in this paper are the two general techniques, discrete relaxation and dynamic programming. While the discrete relaxation is discussed with reference to scene labeling and edge interpretation, the dynamic programming is tuned for stereo.

  11. Processor Efficient Parallel Solution of Linear Systems over an Abstract Field*

    E-print Network

    Kaltofen, Erich

    higher. 1. Introduction A processor efficient parallel algorithm is a parallel algo- rithm that has. An individual step in our algo- rithms is an addition, subtraction, multiplication, divi- sion, or zero present processor efficient randomized parallel algo- rithms for solving non-singular systems

  12. Particle simulation of plasmas on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Gledhill, I. M. A.; Storey, L. R. O.

    1987-01-01

    Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.

  13. Prototype Focal-Plane-Array Optoelectronic Image Processor

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey

    1995-01-01

    Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.

  14. An Architecture for Large ModSAF Simulations Using Scalable Parallel Processors

    Microsoft Academic Search

    Sharon Brunett; Thomas Gottschalk

    1997-01-01

    An implementation of ModSAF for Scalable Parallel Processors (SPPs) is presented. This model ex- ploits the large number of processing elements and fast interprocessor communications of SPPs to simulate many thousands of vehicles on a single SPP. The implementation uses a heterogeneous assignment of tasks to processors, with most processors running independent copies of the standard SAFSim code and additional

  15. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  16. Digital signal processor and programming system for parallel signal processing

    SciTech Connect

    Van den Bout, D.E.

    1987-01-01

    This thesis describes an integrated assault upon the problem of designing high-throughput, low-cost digital signal-processing systems. The dual prongs of this assault consist of: (1) the design of a digital signal processor (DSP) which efficiently executes signal-processing algorithms in either a uniprocessor or multiprocessor configuration, (2) the PaLS programming system which accepts an arbitrary algorithm, partitions it across a group of DSPs, synthesizes an optimal communication link topology for the DSPs, and schedules the partitioned algorithm upon the DSPs. The results of applying a new quasi-dynamic analysis technique to a set of high-level signal-processing algorithms were used to determine the uniprocessor features of the DSP design. For multiprocessing applications, the DSP contains an interprocessor communications port (IPC) which supports simple, flexible, dataflow communications while allowing the total communication bandwidth to be incrementally allocated to achieve the best link utilization. The net result is a DSP with a simple architecture that is easy to program for both uniprocessor and multi-processor modes of operation. The PaLS programming system simplifies the task of parallelizing an algorithm for execution upon a multiprocessor built with the DSP.

  17. Implementation of SAR interferometric map generation using parallel processors

    SciTech Connect

    Doren, N.; Wahl, D.E.

    1998-07-01

    Interferometric fringe maps are generated by accurately registering a pair of complex SAR images of the same scene imaged from two very similar geometries, and calculating the phase difference between the two images by averaging over a neighborhood of pixels at each spatial location. The phase difference (fringe) map resulting from this IFSAR operation is then unwrapped and used to calculate the height estimate of the imaged terrain. Although the method used to calculate interferometric fringe maps is well known, it is generally executed in a post-processing mode well after the image pairs have been collected. In that mode of operation, there is little concern about algorithm speed and the method is normally implemented on a single processor machine. This paper describes how the interferometric map generation is implemented on a distributed-memory parallel processing machine. This particular implementation is designed to operate on a 16 node Power-PC platform and to generate interferometric maps in near real-time. The implementation is able to accommodate large translational offsets, along with a slight amount of rotation which may exist between the interferometric pair of images. If the number of pixels in the IFSAR image is large enough, the implementation accomplishes nearly linear speed-up times with the addition of processors.

  18. Improving system performance in contiguous processor allocation for mesh-connected parallel systems

    Microsoft Academic Search

    Kyung-hee Seo; Sung-chun Kim

    2003-01-01

    Fragmentation is the main performance bottleneck of large, multiuser parallel computer systems. Current contiguous processor allocation techniques for mesh-connected parallel systems are restricted to rectangular submesh allocation strategies causing significant fragmentation problems. This paper presents an L-shaped submesh allocation (LSSA) strategy, which lifts the restriction on the rectangular shape formed by allocated processors in order to address the problem of

  19. Fast and processor-efficient parallel algorithms for reducible-flow graphs. Technical report

    Microsoft Academic Search

    Ramachandran

    1988-01-01

    This document presents parallel NC algorithms for recognizing reducible flow graphs (rfg's), and for finding dominators, minimum feedback vertex sets, and a depth first-search numbering in an rfg. All of these algorithms run in polylog parallel time using M (n) processors, where M (n) is the number of processors needed to multiply two nxn matrices in polylog time; this is

  20. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  1. An FPGA based Phased Array Processor for the Sub-Millimeter Array

    E-print Network

    Nagpal, Vinayak

    2012-01-01

    It has been widely acknowledged that Very Long Baseline Interferometry (VLBI) in the submillimeter wavelengths can make imaging observations of super massive black holes possible. The Sub-Millimeter Array (SMA) along with the James Clerk Maxwell Telescope (JCMT) and Caltech Submillimeter Observatory (CSO) on the Mauna Kea summit in Hawaii can together provide a large collecting area as one or more stations for VLBI observations aimed at studying an event horizon. To work as a VLBI station with full collecting area the SMA (or a combination SMA, JCMT, CSO antennas) would need a processor to enable phased array operation. This masters project focusses on building such a processor. Back end processing for high bandwidth radio telescopes has traditionally been done using custom designed application specific integrated circuits (ASIC). Recent advances in Field Programmable Gate Array (FPGA) technology have made FPGAs both powerful and economically viable for radio astronomy back ends. We have attempted to take adv...

  2. Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

    1997-01-01

    A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

  3. A systolic parallel processor for the rapid computation of multiresolution edge images using the del/sup 2/G operator

    SciTech Connect

    Clark, J.J.; Lawrence, P.D.

    1985-08-01

    This paper describes the application of current parallel processor technology to an important problem in computer vision; the computation of a multiresolution set of edge images from a video camera signal. The edge operator of choice for many image analysis systems is the del/sup 2/G or Laplacian of a Gaussian operator. This operator locates edges by finding the zero crossings of the del/sup 2/G filtered image. A hardware system for the production of a zero crossing ''pyramid'' is proposed. The hardware processor utilizes a set of ''systolic'' array processors which implement tow-dimensional digital lowpass and bandpass filters as well as the zero crossing detectors. A multilevel interleaved system is described which allows concurrent processing of two sets of image descriptions and ensures that the component processing elements are utilized to the fullest.

  4. On nonlinear finite element analysis in single-, multi- and parallel-processors

    NASA Technical Reports Server (NTRS)

    Utku, S.; Melosh, R.; Islam, M.; Salama, M.

    1982-01-01

    Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

  5. Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

    Microsoft Academic Search

    Jung Ho Ahn; Mattan Erez; William J. Dally

    2007-01-01

    This paper explores the scalability of the Stream Processor ar- chitecture along the instruction-, data-, and thread-level paral- lelism dimensions. We develop detailed VLSI-cost and processor- performance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit the different types of parallelism. We show that the hardware overhead of supporting

  6. Parallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR Compiler

    E-print Network

    Kasahara, Hironori

    -efficient. These requirements can be realized by integrated control systems with enhanced electric control units, or real-generation automobiles integrated control system. In terms of multi-core processors in the automotive controlParallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR

  7. Constant time algorithms for some geometric intersection problems on processor arrays with reconfigurable bus systems

    E-print Network

    Pathikonda, Chakrapani

    1991-01-01

    I, a 4 x 6 PARBS is depicted, 1 2 3 4 5 U R D Fig. l. A 4 x 6 processor array with a reconfigurable bus system. For example, by connecting port L to port R within each processor, as in Fig. 2, horizontally straight buses can be established...) and HIGH JD (the highest processor ID with 5;=I). Initially, each processor P, of the linear PARBS has a binary value b;. At the end of the sub-algorithm, all processors know the LOW ID and the HIGH JD. 2. Sub-Algorithm 5. 1 Step 0: /* LOW ID */. Each...

  8. Parallel H-Tree Based Data Cubing on Graphics Processors Baoyuan Wang

    E-print Network

    Yu, Yizhou

    Parallel H-Tree Based Data Cubing on Graphics Processors Baoyuan Wang Yizhou Yu Abstract Graphics@gmail.com Yizhou Yu is with Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL

  9. Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

    Microsoft Academic Search

    Tanping Wang; Filip Blagojevic; Dimitrios S. Nikolopoulos

    2004-01-01

    This paper presents runtime mechanisms that enable flexible use of speculative precomputation in conjunction with thread-level parallelism on SMT processors. The mechanisms were implemented and evaluated on a real multi-SMT system. So far, speculative precomputation and thread-level parallelism have been used disjunctively on SMT processors and no attempts have been made to compare and possibly combine these techniques for further

  10. Application of the hypercube parallel processor to a large-scale moment method code

    NASA Technical Reports Server (NTRS)

    Manshadi, Farzin; Liewer, Paulet C.; Patterson, Jean E.

    1988-01-01

    The applicability of a parallel computing architecture to the solution of a large-scale moment-method code is investigated. Specifically, the NEC (Numerical Electromagnetics Code) method-of-moments scattering program is implemented on a hypercube parallel processor. The accuracy and the increase in the speed of execution on this parallel architecture are demonstrated. The results show a very large reduction in execution time for large problems. The great potential of this parallel processor is shown for interactive solution of large NEC problems as well as other moment-method techniques such as the finite-element method.

  11. A Compact FPGA Implementation of a Bit-Serial SIMD Cellular Processor Array

    E-print Network

    Dudek, Piotr

    A Compact FPGA Implementation of a Bit-Serial SIMD Cellular Processor Array Declan Walsh and Piotr Kingdom declan.walsh@postgrad.manchester.ac.uk; p.dudek@manchester.ac.uk Abstract-- An FPGA implementation to form an array. A 32 × 32 processing element array is implemented on a low-cost Xilinx XC5VLX50 FPGA

  12. Parallel signal processing

    NASA Astrophysics Data System (ADS)

    McWhirter, John G.

    1989-12-01

    The potential application of parallel computing techniques to digital signal processing for radar is discussed and two types of regular array processor are discussed. The first type of processor is the systolic or wavefront processor. The application of this type of processor to adaptive beamforming is discussed and the joint STL-RSRE adaptive antenna processor test-bed is reviewed. The second type of regular array processor is the SIMD parallel computer. One such processor, the Mil-DAP, is described, and its application to a varied range of radar signal processing tasks is discussed.

  13. Optimal Frequency Band Design Scheme of Dyadic Wavelet Processor Array Using Surface Acoustic Wave Devices

    Microsoft Academic Search

    Changbao Wen; Changchun Zhu; Yongfeng Ju; Yanzhang Qiu; Hongke Xu; Wenke Lu

    2009-01-01

    In this paper, the relationship between the center frequency and radius of bandwidth and its effect on the frequency band characteristics of dyadic wavelet processor array using surface acoustic wave (SAW) devices are studied, and an optimal frequency band design scheme is proposed. For an arbitrary scale wavelet processor, we proposed that the center frequency is defined to three times

  14. Embedded processor for array of hydrophone sensors to construct real time images for AUV using FPGA

    Microsoft Academic Search

    Muataz H. Salih; M. R. Arshad

    2009-01-01

    Implementation of embedded systems-on-chip on modern field programmable gate arrays (FPGAs) chip is doable due to its large density. Architecture of multilevel computing focusing on its embedded processor is suggested in our project. The architecture design of embedded processor presents the challenges and opportunities that stem from the task coarse granularity and the large number of input and output for

  15. Approximating Euclidean Distance Transform with Simple Operations in Cellular Processor Arrays

    E-print Network

    Dudek, Piotr

    Approximating Euclidean Distance Transform with Simple Operations in Cellular Processor Arrays good approximation to Euclidean distances, operating with `increment' and `minimum' operations only to Euclidean, City Block, Chessboard and Chamfer distance transforms. I. INTRODUCTION In digital image

  16. On fault-tolerant structure, distributed fault-diagnosis, reconfiguration, and recovery of the array processors

    SciTech Connect

    Hosseini, S.H.

    1989-07-01

    The increasing need for the design of high-performance computers has led to the design of special purpose computers such as array processors. This paper studies the design of fault-tolerant array processors. First, it is shown how hardware redundancy can be employed in the existing structures in order to make them capable of withstanding the failure of some of the array links and processors. Then distributed fault-tolerance schemes are introduced for the diagnosis of the faulty elements, reconfiguration, and recovery of the array. Fault tolerance is maintained by the cooperation of processors in a decentralized form of control without the participation of any type of hardcore or fault-free central controller such as a host computer.

  17. Array processor featuring an effective FIFO-based data stream management

    Microsoft Academic Search

    Toshiaki Miyazaki; Yuusuke Nomoto; Yuka Sato; Stanislav G. Sedukhin

    2008-01-01

    In array processors, data I\\/O management is the key to realizing high-speed matrix operations that are often required in signal and image processing. In this paper, we propose an array processor utilizing an effective data I\\/O mechanism featuring external FIFOs. The FIFOs are used to buffer initial matrix data and partially processed results. Therefore, if all required data are stored

  18. Biologically-Inspired Massively-Parallel Architectures - Computing Beyond a Million Processors

    Microsoft Academic Search

    Stephen B. Furber; Andrew D. Brown

    2009-01-01

    The SpiNNaker project aims to develop parallel computer systems with more than a million embedded processors. The goal of the project is to support large-scale simulations of systems of spiking neurons in biological real time, an application that is highly parallel but also places very high loads on the communication infrastructure due to the very high connectivity of biological neurons.

  19. Mathematical and numerical models to achieve high speed with special-purpose parallel processors

    SciTech Connect

    Cheng, H.S.; Wulff, W.; Mallen, A.N.

    1986-07-01

    One simulation facility that has been developed is the BNL Plant Analyzer, currently set up for BWR plant simulations at up to seven times faster than real-time process speeds. The principal hardware components of the BNL Plant Analyzer are two units of special-purpose parallel processors, the AD10 of Applied Dynamics International and a PDP-11/34 host computer. The AD10 is specifically designed for time-critical system simulations, utilizing the modern parallel processing technology with pipeline architecture. The simulator employs advanced modeling techniques and efficient integration techniques in conjunction with the parallel processors to achieve high speed performance.

  20. Preliminary study on the potential usefulness of array processor techniques for structural synthesis

    NASA Technical Reports Server (NTRS)

    Feeser, L. J.

    1980-01-01

    The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.

  1. Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP

    Microsoft Academic Search

    Toshihiro Hanawa; Mitsuhisa Sato; Jinpil Lee; Takayuki Imada; Hideaki Kimura; Taisuke Boku

    2009-01-01

    Recently, multicore technology has been introduced to embedded systems in order to improve performance and reduce power consumption.\\u000a In the present study, three SMP multicore processors for embedded systems and a multicore processor for a desktop PC are evaluated\\u000a by the parallel benchmark using OpenMP. The results indicate that, even if the memory performance is low, applications that\\u000a are not

  2. An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors

    Microsoft Academic Search

    Daisuke Takahashi

    2006-01-01

    In the present paper, an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Streaming SIMD Extensions\\u000a 3 (SSE3) instructions on dual-core processors is proposed. Combination of vectorization and the block six-step FFT algorithm\\u000a is shown to effectively improve performance. The performance results for one-dimensional FFTs on dual-core Intel Xeon processors\\u000a are reported. We successfully achieved performance of approximately

  3. High-speed Systolic Array Processor (HISSAP) system development synopsis: Lesson learned. Final report, Oct 83-Oct 90

    SciTech Connect

    Loughlin, J.P.

    1991-05-01

    This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic and software mapping tools is described. Based on the experience gained in the mapping of matrix-based algorithms on the testbed hardware, specific recommendations are presented in the form of lessons learned, which are intended to offer guidance in the development of future Navy signal processing systems.

  4. Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (inventor); Bejczy, Antal K. (inventor)

    1994-01-01

    In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  5. Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Gibson, Garth Alan

    1990-01-01

    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

  6. Using algebra for massively parallel processor design and utilization

    NASA Technical Reports Server (NTRS)

    Campbell, Lowell; Fellows, Michael R.

    1990-01-01

    This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

  7. Implementation and evaluation of FAST corner detection on the massively parallel embedded processor MX-G

    Microsoft Academic Search

    Yushi Moko; Takashi Komuro; Masami Nakajima; Yoshihiro Watanabe; Masatoshi Ishikawa; Kazutami Arimoto

    2011-01-01

    We implemented and evaluated the FAST corner detection algorithm on the MX-G, a system LSI device with a matrix-type massively parallel processor ”MX core” developed by Renesas Electronics Corp. FAST corner detection is a very efficient feature detection algorithm. We developed a method to parallelize the FAST algorithm by using both the MX core and the SH-2A host CPU effectively.

  8. Distributed Array Management Scheme for Data-Parallel Compilers

    E-print Network

    Paris-Sud XI, Université de

    Distributed Array Management Scheme for Data-Parallel Compilers Yves Maheo, Jean-Louis Pazat IRISA parallel programs for distributed memory machines. As data distribution is a key-feature for exploiting distributed arrays. We present in this paper an innovative method to allocate local blocks and temporaries

  9. A Reliable Processor-Allocation Strategy for Mesh-Connected Parallel Systems

    Microsoft Academic Search

    Kyung-hee Seo; Sung-chun Kim

    2001-01-01

    Efficient utilization of processing resources in a large, multi-user parallel computer system depends on the reliable processor allocation algorithms. The paper presents and LSSA (L-shaped submesh allocation) strategy to reduce external fragmentation and job response time, simultaneously. LSSA manipulates the shape of the required submesh to fit into the fragmented mesh system and accommodates incoming jobs faster than other strategies.

  10. Multiresolution spatially constrained clustering of remotely sensed data on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Tilton, J. C.

    1984-01-01

    A multiresolution spatially-constrained clustering (MSCC) algorithm implemented on the Massively Parallel Processor is discussed. The MSCC algorithm uses a coarse-to-fine resolution schema and region switching and region splitting algorithms. Applications include production of segments of remotely sensed imagery and as front and for a classification of a cluster compression algorithm.

  11. Efficient Algorithms for Parallel Excitation and Parallel Imaging with Large Arrays

    E-print Network

    Feng, Shuo

    2013-08-12

    RF signals using phased arrays are called parallel excitation (pTx) and parallel imaging (PI), respectively. These two techniques lead to shorter transmit pulses for higher imaging quality and faster data acquisition correspondingly...

  12. Impact of shipping Ball-Grid-Array Notebook processors in tape and reel on the PC supply chain

    E-print Network

    Chuang, Pamela

    2012-01-01

    Today, approximately 90% of Intel notebook processors are packaged in PGA (Pin Grid Array) and 10% are packaged in BGA (Ball Grid Array). Intel has recently made a decision to transform the notebook industry by creating a ...

  13. Series-parallel method of direct solar array regulation

    NASA Technical Reports Server (NTRS)

    Gooder, S. T.

    1976-01-01

    A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

  14. PostProcessor Development for a Six Degree-of-Freedom Parallel-Link Machine Tool

    Microsoft Academic Search

    S.-L. Chen; Y.-C. Liu

    2001-01-01

    A coordinate system with multiple non-orthogonal axes, defined for a parallel-link machine tool is very different from that\\u000a for a conventional serial-type machine. Therefore, a special post-processor that can automatically transfer the cutter location\\u000a data (CL-data) into machine specific NC commands is essential for real machining applications of a parallel-link machine tool.\\u000a Parallel-link machine tools have been investigated by many

  15. Ferroelectric/Optoelectronic Memory/Processor

    NASA Technical Reports Server (NTRS)

    Thakoor, Sarita; Thakoor, Anilkumar P.

    1992-01-01

    Proposed hybrid optoelectronic nonvolatile analog memory and data processor comprises planar array of microscopic photosensitive ferroelectric capacitors performing massively parallel analog computations. Processors overcome electronic crosstalk and limitations on number of input/output contacts inherent in electronic implementations of large interconnection arrays. Used in general optical computing, recognition of patterns, and artificial neural networks.

  16. Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition

    DOEpatents

    Tomkins, James L. (Albuquerque, NM); Camp, William J. (Albuquerque, NM)

    2007-07-17

    A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

  17. Optimal piecewise linear schedules for LSGP- and LPGS-decomposed array processors via quadratic programming

    NASA Astrophysics Data System (ADS)

    Zimmermann, Karl-Heinz; Achtziger, Wolfgang

    2001-09-01

    The size of a systolic array synthesized from a uniform recurrence equation, whose computations are mapped by a linear function to the processors, matches the problem size. In practice, however, there exist several limiting factors on the array size. There are two dual schemes available to derive arrays of smaller size from large-size systolic arrays based on the partitioning of the large-size arrays into subarrays. In LSGP, the subarrays are clustered one-to-one into the processors of a small-size array, while in LPGS, the subarrays are serially assigned to a reduced-size array. In this paper, we propose a common methodology for both LSGP and LPGS based on polyhedral partitionings of large-size k-dimensional systolic arrays which are synthesized from n-dimensional uniform recurrences by linear mappings for allocation and timing. In particular, we address the optimization problem of finding optimal piecewise linear timing functions for small-size arrays. These are mappings composed of linear timing functions for the computations of the subarrays. We study a continuous approximation of this problem by passing from piecewise linear to piecewise quasi-linear timing functions. The resultant problem formulation is then a quadratic programming problem which can be solved by standard algorithms for nonlinear optimization problems.

  18. A processing element architecture for high-density focal plane analog programmable array processors

    Microsoft Academic Search

    Gustavo Liñan Cembrano; Servando Espejo-meana; Rafael Domínguez-castro; Ángel Rodríguez-vázquez

    2002-01-01

    The architecture of the elementary Processing Element - PE - used in a recently designed 128×128 Focal Plane Analog Programmable Array Processor is presented. The PE architecture contains the required building blocks to implement bifurcated data flow vision algorithms based on the execution of 3×3 convolution masks. The vision chip has been implemented in a standard 0.35 ?m CMOS technology.

  19. A unified approach to VLSI layout automation and algorithm mapping on processor arrays

    NASA Technical Reports Server (NTRS)

    Venkateswaran, N.; Pattabiraman, S.; Srinivasan, Vinoo N.

    1993-01-01

    Development of software tools for designing supercomputing systems is highly complex and cost ineffective. To tackle this a special purpose PAcube silicon compiler which integrates different design levels from cell to processor arrays has been proposed. As a part of this, we present in this paper a novel methodology which unifies the problems of Layout Automation and Algorithm Mapping.

  20. Coupled cluster algorithms for networks of shared memory parallel processors

    NASA Astrophysics Data System (ADS)

    Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.

    2007-05-01

    As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.

  1. An analogue SIMD focal-plane processor array

    Microsoft Academic Search

    Piotr Dudek; Peter J. Hicks

    2001-01-01

    A new smart-sensor VLSI circuit intended for focal-plane processing of grey-scale images is presented. The architecture is based on a fine-grain software-programmable SIMD array. Processing elements, integrated within each pixel of the imager, are implemented utilising a switched-current analogue microprocessor concept. In a 0.6?m CMOS process the cell size is equal to 98.6?m×98.6?m. A prototype 21×21 array chip executes over

  2. Q-plates micro-arrays for parallel processing of the photon orbital angular momentum

    NASA Astrophysics Data System (ADS)

    Loussert, Charles; Kushnir, Kateryna; Brasselet, Etienne

    2014-09-01

    We report on the realization of electrically tunable micro-arrays of space-variant optically anisotropic optical vortex generators. Each individual light orbital angular momentum processor consists of a microscopic self-engineered nematic liquid crystal q-plate made of a nonsingular topological defect spontaneously formed under electric field. Both structural and optical characterizations of the obtained spin-orbit optical interface are analyzed. An analytical model is derived and results of simulations are compared with experimental data. The application potential in terms of parallel processing of the optical orbital angular momentum is quantitatively discussed.

  3. Parallelized reconstruction of spiral Fourier velocity encoding MRI data on multicore processors

    E-print Network

    Carvalho, João Luiz

    -core processors 2D (x,y), 4D (x,y,v,t), and 5D (x,y,z,v,t) data Parfor Loops Parfor restrictions changes be reduced using spiral trajectories in kx-ky (spatial encoding) [1] Spiral FVE: long reconstruction time FVE k-space trajectory: a stack-of-spirals in kx-ky-kv [1] Parallelized Reconstruction in Matlab Use

  4. Construction of a parallel processor for simulating manipulators and other mechanical systems

    NASA Technical Reports Server (NTRS)

    Hannauer, George

    1991-01-01

    This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

  5. The Design and Implementation of the Massively Parallel Processor Based on the Matrix Architecture

    Microsoft Academic Search

    Hideyuki Noda; Masami Nakajima; Katsumi Dosaka; Kiyoshi Nakata; Motoki Higashida; Osamu Yamamoto; Katsuya Mizumoto; Tetsushi Tanizaki; Takayuki Gyohten; Yoshihiro Okuno; Hiroyuki Kondo; Yukihiko Shimazu; Kazutami Arimoto; Kazunori Saito; Toru Shimizu

    2007-01-01

    This paper describes the design and implementation of the massively parallel processor based on the matrix architecture which is suitable for portable multimedia applications. The proposed architecture in this paper achieves the high performance of 40 GOPS in the case of consecutive fixed-point 16-bit additions at 200MHz clock frequency and the small power dissipation of 250mW. In addition, 1Mbit SRAM

  6. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    SciTech Connect

    Cancelo, G. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Gottschalk, Erik Edward [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Pavlicek, V. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wang, M. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wu, J. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  7. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

    SciTech Connect

    Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.; Catalyurek, Umit V.; Kurc, Tahsin; Sadayappan, Ponnuswamy; Saltz, Joel H.

    2009-08-01

    Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.

  8. Smart pixel memory buffer array with parallel and serial access

    Microsoft Academic Search

    K. K. Chau; M. W. Derstine; S. Wakelin; J. Cloonan; F. Klmilev; A. Krishnamoorthy; K. W. Goossen

    1996-01-01

    In summary, an optical memory smart pixel array had been designed and fabricated. The functions and performance of this device had been tested and confirmed. We demonstrated a smart pixel capable of storing 4 pages of 32-bit memory with parallel operation of the smart pixel array at clock rates up to 277 MHz

  9. A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors

    PubMed Central

    Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

    2009-01-01

    A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 ?m Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 ?m × 9.5 ?m and each processing element size is 23 ?m × 29 ?m. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking. PMID:22454565

  10. Applications of array processors in the analysis of remote sensing images

    NASA Technical Reports Server (NTRS)

    Ramapriyan, H. K.; Strong, J. P.

    1984-01-01

    The architectures, programming characteristics, and ranges of application of past, present, and planned array processors for the digital processing of remote-sensing images are compared. Such functions as radiometric and geometric corrections, principal-components analysis, cluster coding, histogram generation, grey-level mapping, convolution, classification, and mensuration and modeling operations are considered, and both pipeline-type and single-instruction/multiple-data-stream (SIMD) arrays are evaluated. Numerical results are presented in a table, and it is found that the pipeline-type arrays normally used with minicomputers increase their speed significantly at low cost, while even further gains are provided by the more expensive SIMD arrays. Most image-processing operations become I/O-limited when SIMD arrays are used with current I/O devices.

  11. Analysis of array errors and a short-time processor in airborne phased array radars

    Microsoft Academic Search

    Qing-Guang Liu; Ying-Ning Peng

    1996-01-01

    Array errors are inherent in a realistic phased array radar system. The influence of array errors on the clutter degrees of freedom and the clutter subspace in an airborne phased array radar is analyzed. Based on the presented theoretic results, a method of short-time processing followed by coherent integration is proposed for clutter suppression in airborne phased array radars. It

  12. A CCD\\/CMOS focal-plane array edge detection processor implementing the multiscale veto algorithm

    Microsoft Academic Search

    Lisa Dron McIlrath

    1996-01-01

    A prototype 32×32 array processor fabricated in 2-?m charge coupled devices (CCD)\\/CMOS technology implementing the multiscale veto edge detection algorithm is presented. In this algorithm, differences between pixel values are computed in the original image, as well as after applying a series of smoothing filters of varying spatial scales. An edge exists between two pixels only if the magnitude of

  13. A PROCESSING ELEMENT ARCHITECTURE FOR HIGH-DENSITY FOCAL PLANE ANALOG PROGRAMMABLE ARRAY PROCESSORS

    Microsoft Academic Search

    G. Liñán-Cembrano; S. Espejo; R. Domínguez-Castro; A. Rodríguez-Vázquez

    The architecture of the elementary Processing Element - PE- used in a recently designed 128x128 Focal Plane Analog Programmable Array Processor is presented. The PE architecture contains the required building blocks to implement bifurcated data flow vision algorithms based on the execution of convolution masks. The vision chip has been implemented in a standard 0.35µ mC MOS technology. The main

  14. Design of Optimal Array Processors for Two-Step Division-Free Gaussian Elimination

    Microsoft Academic Search

    Shietung PENG; Stanislav G. SEDUKHIN

    SUMMARY The design of array processors for solving lin- ear systems using two-step division-free Gaussian elimination method is considered. The two-step method can be used to im- prove the systems based on the one-step method in terms of nu- merical stability as well as the requirements for high-precision. In spite of the rather complicated computations needed at each it- eration

  15. Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies

    NASA Astrophysics Data System (ADS)

    Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio

    2011-11-01

    Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.

  16. Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

    SciTech Connect

    Aaby, Brandon G [ORNL; Perumalla, Kalyan S [ORNL; Seal, Sudip K [ORNL

    2010-01-01

    An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

  17. Determination of the Rotational and Translational Components of a Flow Field Using a Content Addressable Parallel Processor

    Microsoft Academic Search

    Martha E. Steenstrup; Daryl T. Lawton; Charles C. Weems

    1983-01-01

    The realization of motion perception in artificial systems will require highly parallel architectures. The authors demonstrate the use of a content addressable parallel processor as an effective means of quickly and accurately decomposing a flow field into its rotational and translational components to recover the parameters of sensor motion. 2 references.

  18. Parallel arrays of Josephson junctions for submillimeter local oscillators

    NASA Technical Reports Server (NTRS)

    Pance, Aleksandar; Wengler, Michael J.

    1992-01-01

    In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

  19. Stochastic simulation of charged particle transport on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Earl, James A.

    1988-01-01

    Computations of cosmic-ray transport based upon finite-difference methods are afflicted by instabilities, inaccuracies, and artifacts. To avoid these problems, researchers developed a Monte Carlo formulation which is closely related not only to the finite-difference formulation, but also to the underlying physics of transport phenomena. Implementations of this approach are currently running on the Massively Parallel Processor at Goddard Space Flight Center, whose enormous computing power overcomes the poor statistical accuracy that usually limits the use of stochastic methods. These simulations have progressed to a stage where they provide a useful and realistic picture of solar energetic particle propagation in interplanetary space.

  20. Block iterative restoration of astronomical images with the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Heap, Sara R.; Lindler, Don J.

    1987-01-01

    A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

  1. Solar-pumped Nd:Cr:GSGG parallel array laser

    Microsoft Academic Search

    George A. Thompson; V. Krupkin; Amnon Yogev; Moshe Oron

    1992-01-01

    A compact, parallel array of three Nd:Cr:GSGG laser rods is used to construct a quasi-CW laser. The array is pumped by concentrated solar light and is mounted in a single concentrator. The three laser rods use a common pair of laser mirrors to define the optical resonator. The three laser beams are not coherently coupled in these experiments. The simplicity

  2. Animated computer graphics models of space and earth sciences data generated via the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

    1987-01-01

    The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

  3. Processor\\/memory\\/array size tradeoffs in the design of SIMD arrays for a spatially mapped workload

    Microsoft Academic Search

    Martin C. Herbordt; Anisha Anand; Owais Kidwai; Renoy Sam; Charles C. Weems

    1997-01-01

    Though massively parallel SIMD arrays continue to be promising for many computer vision applications, they have undergone few systematic empirical studies. The problems include the size of the architecture space, the lack of portability of the test programs, and the inherent complexity of simulating up to hundreds of thousands of processing elements. The latter two issues have been addressed previously,

  4. A finite field processor employing dual parallel data path for high-speed\\/low-power RS-ECC applications

    Microsoft Academic Search

    Hyung-Joon Kwon; Young-Beom Jang; Bangwon Lee

    2000-01-01

    We suggest a finite field processor for a Reed-Solomon decoder which can be used for next generation DVDP\\/ROM, HD-DVD applications. The suggested processor implements Massey-Berlekamp's FSR synthesis algorithm which solves Newton's identity to find the coefficients of the error locator polynomial. By employing a SIMD-like dual parallel pipelined data path which exploited the characteristics of the required computation of the

  5. Array Optimizations for Parallel Implementations of High Productivity Languages

    E-print Network

    Budimliæ, Zoran

    Array Optimizations for Parallel Implementations of High Productivity Languages Mackale Joyner, vsarkar, ruizhang}@cs.rice.edu Abstract DARPA's HPCS program has set a goal of bringing high pro- ductivity to high-performance computing. This has resulted in the creation of three new high-level languages

  6. High-frequency optoacoustic arrays using parallel etalon detection

    NASA Astrophysics Data System (ADS)

    Huang, Sheng-Wen; Hou, Yang; Ashkenazi, Shai; O'Donnell, Matthew

    2008-02-01

    Here we present an ultrasound detection system with an optical end capable of parallel probe. An erbium-doped fiber amplifier, driven by a tunable laser, outputs light at 27 dBm. A lens collimates the light to probe a 6-?m thick SU-8 etalon and controls the parallel detection area (total array size). A two-lens system guides the reflected light into a photodetector and controls the active area (array element size) on the etalon surface. A translation stage carries the photodetector to detect signals from different array elements. The output of the photodetector is recorded using an oscilloscope. The system's noise equivalent pressure was estimated to be 6.5 kPa over 10~50 MHz using a calibrated piezoelectric transducer when the -3 dB parallel detection area was 1.8 mm in diameter. The detection bandwidth was estimate to exceed 70 MHz using a focused 50 MHz piezoelectric transducer. Using a single probe wavelength, a 1D array with 41 elements and a 1.06 mm aperture length was formed to image a 49 ?m black bead photoacoustically. The final image shows an object size of about 95 ?m in diameter. According to the results, realizing high-frequency 2D optoacoustic arrays using an etalon is possible.

  7. Basic data-base operations on the Butterfly Parallel Processor: experiment results. Memorandum report, January-December 1987

    SciTech Connect

    Rosenau, T.J.; Jajodia, S.

    1988-03-04

    The next phase in speeding up data-base queries will be through the use of highly parallel computers. This paper will discuss the basic data-base operations (select, project, natural join, and scaler aggregates) on a shared-memory multiple instruction stream, multiple data stream (MIMD) computer and the problems associated with implementing them. Some problems associated with getting maximum parallelization are improper data division and hot spots. Improper data division results when the number of tasks does not divide evenly among the processors. Hot spots or contentions occur due to locking if accesses are made to the same segment of a RAMFile and also if attempts are made to get data from the same remote processor at the same time. These algorithms have been implemented on the Butterfly Parallel Processor, and the results of our experiments are described in detail.

  8. A SIMD Cellular Processor Array Vision Chip With Asynchronous Processing Capabilities

    Microsoft Academic Search

    Alexey Lopich; Piotr Dudek

    2011-01-01

    This paper describes an architecture and implemen- tation of a digital vision chip that features mixed asynchronous\\/ synchronous processing techniques. The vision chip is based on a massively parallel cellular array of processing elements, which incorporate a photo-sensor with an ADC and digital processing circuit, consisting of 64 bits of local memory, ALU, flag register and communication units. The architecture

  9. Parallel computation of optimized arrays for 2-D electrical imaging surveys

    NASA Astrophysics Data System (ADS)

    Loke, M. H.; Wilkinson, P. B.; Chambers, J. E.

    2010-12-01

    Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman-Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix-vector multiplications for a single add-on array were replaced with matrix-matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers.

  10. Parallel nanoimaging and nanolithography using a heated microcantilever array

    NASA Astrophysics Data System (ADS)

    Somnath, Suhas; Kim, Hoe Joon; Hu, Huan; King, William P.

    2014-01-01

    We report parallel topographic imaging and nanolithography using heated microcantilever arrays integrated into a commercial atomic force microscope (AFM). The array has five AFM cantilevers, each of which has an internal resistive heater. The temperatures of the cantilever heaters can be monitored and controlled independently and in parallel. We perform parallel AFM imaging of a region of size 550 ?m × 90 ?m, where the cantilever heat flow signals provide a measure of the nanometer-scale substrate topography. At a cantilever scan speed of 1134 ?m s-1, we acquire a 3.1 million-pixel image in 62 s with noise-limited vertical resolution of 0.6 nm and pixels of size 351 nm × 45 nm. At a scan speed of 4030 ?m s-1 we acquire a 26.4 million pixel image in 124 s with vertical resolution of 5.4 nm and pixels of size 44 nm × 43 nm. Finally, we demonstrate parallel nanolithography with the cantilever array, including iterations of measure-write-measure nanofabrication, with each cantilever operating independently.

  11. Parallel nanoimaging and nanolithography using a heated microcantilever array.

    PubMed

    Somnath, Suhas; Kim, Hoe Joon; Hu, Huan; King, William P

    2014-01-10

    We report parallel topographic imaging and nanolithography using heated microcantilever arrays integrated into a commercial atomic force microscope (AFM). The array has five AFM cantilevers, each of which has an internal resistive heater. The temperatures of the cantilever heaters can be monitored and controlled independently and in parallel. We perform parallel AFM imaging of a region of size 550 ?m × 90 ?m, where the cantilever heat flow signals provide a measure of the nanometer-scale substrate topography. At a cantilever scan speed of 1134 ?m s(-1), we acquire a 3.1 million-pixel image in 62 s with noise-limited vertical resolution of 0.6 nm and pixels of size 351 nm × 45 nm. At a scan speed of 4030 ?m s(-1) we acquire a 26.4 million pixel image in 124 s with vertical resolution of 5.4 nm and pixels of size 44 nm × 43 nm. Finally, we demonstrate parallel nanolithography with the cantilever array, including iterations of measure-write-measure nanofabrication, with each cantilever operating independently. PMID:24334342

  12. NOSC (Naval Ocean Systems Center) advanced systolic array processor (ASAP). Professional paper for period ending August 1987

    SciTech Connect

    Loughlin, J.P.

    1987-12-01

    Design of a high-speed (250 million 32-bit floating-point operations per second) two-dimensional systolic array composed of 16-bit/slice microsequencer structured processors is presented. System-design features such as broadcast data flow, tag bit movement, and integrated diagnostic test registers are described. The software development tools needed to map complex matrix-based signal-processing algorithms onto the systolic-processor system are described.

  13. Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications

    NASA Technical Reports Server (NTRS)

    Bruno, John

    1984-01-01

    The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.

  14. High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

    DOEpatents

    Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

    2002-01-01

    As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

  15. Microchannel cross load array with dense parallel input

    DOEpatents

    Swierkowski, Stefan P.

    2004-04-06

    An architecture or layout for microchannel arrays using T or Cross (+) loading for electrophoresis or other injection and separation chemistry that are performed in microfluidic configurations. This architecture enables a very dense layout of arrays of functionally identical shaped channels and it also solves the problem of simultaneously enabling efficient parallel shapes and biasing of the input wells, waste wells, and bias wells at the input end of the separation columns. One T load architecture uses circular holes with common rows, but not columns, which allows the flow paths for each channel to be identical in shape, using multiple mirror image pieces. Another T load architecture enables the access hole array to be formed on a biaxial, collinear grid suitable for EDM micromachining (square holes), with common rows and columns.

  16. Parallel vacuum arc discharge with microhollow array dielectric and anode

    SciTech Connect

    Feng, Jinghua; Zhou, Lin; Fu, Yuecheng; Zhang, Jianhua; Xu, Rongkun; Chen, Faxin; Li, Linbo; Meng, Shijian, E-mail: mengshijian04@126.com [Institute of Nuclear Physics and Chemistry, China Academy of Engineering Physics, Mianyang 621900 (China)

    2014-07-15

    An electrode configuration with microhollow array dielectric and anode was developed to obtain parallel vacuum arc discharge. Compared with the conventional electrodes, more than 10 parallel microhollow discharges were ignited for the new configuration, which increased the discharge area significantly and made the cathode eroded more uniformly. The vacuum discharge channel number could be increased effectively by decreasing the distances between holes or increasing the arc current. Experimental results revealed that plasmas ejected from the adjacent hollow and the relatively high arc voltage were two key factors leading to the parallel discharge. The characteristics of plasmas in the microhollow were investigated as well. The spectral line intensity and electron density of plasmas in microhollow increased obviously with the decease of the microhollow diameter.

  17. Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

    NASA Astrophysics Data System (ADS)

    Kumaki, Takeshi; Ishizaki, Masakatsu; Koide, Tetsushi; Mattausch, Hans Jürgen; Kuroda, Yasuto; Gyohten, Takayuki; Noda, Hideyuki; Dosaka, Katsumi; Arimoto, Kazutami; Saito, Kazunori

    This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

  18. Lumped-Element Planar Strip Array (LPSA) for Parallel MRI

    PubMed Central

    Lee, Ray F.; Hardy, Christopher J.; Sodickson, Daniel K.; Bottomley, Paul A.

    2007-01-01

    The recently introduced planar strip array (PSA) can significantly reduce scan times in parallel MRI by enabling the utilization of a large number of RF strip detectors that are inherently decoupled, and are tuned by adjusting the strip length to integer multiples of a quarter-wavelength (?/4) in the presence of a ground plane and dielectric substrate. In addition, the more explicit spatial information embedded in the phase of the signals from the strip array is advantageous (compared to loop arrays) for limiting aliasing artifacts in parallel MRI. However, losses in the detector as its natural resonance frequency approaches the Larmor frequency (where the wavelength is long at 1.5 T) may limit the signal-to-noise ratio (SNR) of the PSA. Moreover, the PSA’s inherent ?/4 structure severely limits our ability to adjust detector geometry to optimize the performance for a specific organ system, as is done with loop coils. In this study we replaced the dielectric substrate with discrete capacitors, which resulted in both SNR improvement and a tunable lumped-element PSA (LPSA) whose dimensions can be optimized within broad constraints, for a given region of interest (ROI) and MRI frequency. A detailed theoretical analysis of the LPSA is presented, including its equivalent circuit, electromagnetic fields, SNR, and g-factor maps for parallel MRI. Two different decoupling schemes for the LPSA are described. A four-element LPSA prototype was built to test the theory with quantitative measurements on images obtained with parallel and conventional acquisition schemes. PMID:14705058

  19. Calculating the electrostatic potential of molecular models with separate evaluations by conventional, vector, and array processors.

    PubMed

    Egan, J T; MacElroy, R D

    1984-01-01

    A simple computational scheme for estimating the electrostatic potential about molecular models of moderate size is given. The large amount of calculations required for the evaluation of the hypersurface lends itself to treatment by high speed, unconventional computing machines. The essence of these calculations lies in Coulombic interactions that are computed between hypothetical proton test probes positioned in a gridded region surrounding the model and the partial electrostatic charges (CNDO/2) of each atom in the model. A specific scientific application is discussed which involves the recognition of amino acids and nucleotide bases. Three different evaluations of the potential hypersurface within the context of this approach were made. The first was performed on a VAX 11/780 which is a general purpose machine widely used in the scientific community; the second was performed using a pipelined Vector Processor, the FPS AP-120B; and the third by a processor array, the ILLIAC-IV. A comparison of the architectures and processing speeds of each class of machines is made. The computing power observed is consistent with the design and purpose of each machine. Also discussed are methods for displaying the vast amount of data that result from such calculations. It is determined that computer graphics offers an effective means for extracting information from large amounts of data. Finally, the scientific value of the calculations are briefly discussed. If caution is applied to interpreting the results, then the electrostatic potential (EP) mappings can be useful in identifying sites of potential chemical interactions. PMID:11540822

  20. Effects of rotation on turbulent convection: Direct numerical simulation using parallel processors

    NASA Astrophysics Data System (ADS)

    Chan, Daniel Chiu-Leung

    A new parallel implicit adaptive mesh refinement (AMR) algorithm is developed for the prediction of unsteady behaviour of laminar flames. The scheme is applied to the solution of the system of partial-differential equations governing time-dependent, two- and three-dimensional, compressible laminar flows for reactive thermally perfect gaseous mixtures. A high-resolution finite-volume spatial discretization procedure is used to solve the conservation form of these equations on body-fitted multi-block hexahedral meshes. A local preconditioning technique is used to remove numerical stiffness and maintain solution accuracy for low-Mach-number, nearly incompressible flows. A flexible block-based octree data structure has been developed and is used to facilitate automatic solution-directed mesh adaptation according to physics-based refinement criteria. The data structure also enables an efficient and scalable parallel implementation via domain decomposition. The parallel implicit formulation makes use of a dual-time-stepping like approach with an implicit second-order backward discretization of the physical time, in which a Jacobian-free inexact Newton method with a preconditioned generalized minimal residual (GMRES) algorithm is used to solve the system of nonlinear algebraic equations arising from the temporal and spatial discretization procedures. An additive Schwarz global preconditioner is used in conjunction with block incomplete LU type local preconditioners for each sub-domain. The Schwarz preconditioning and block-based data structure readily allow efficient and scalable parallel implementations of the implicit AMR approach on distributed-memory multi-processor architectures. The scheme was applied to solutions of steady and unsteady laminar diffusion and premixed methane-air combustion and was found to accurately predict key flame characteristics. For a premixed flame under terrestrial gravity, the scheme accurately predicted the frequency of the natural buoyancy induced oscillations. The performance of the proposed parallel implicit algorithm was assessed by comparisons to more conventional solution procedures and was found to significantly reduce the computational time required to achieve a solution in all cases investigated.

  1. A Scalable, Multi-Thread, Multi-Issue Array Processor Architecture for DSP Applications Based on Extended

    E-print Network

    Kuzmanov, Georgi

    A Scalable, Multi-Thread, Multi-Issue Array Processor Architecture for DSP Applications Based and memories are distributed across the chip and communicate with each other by special networks, forming], that was extended to eliminate all central control structures for the data flow and to support multithreading

  2. Image Understanding Architecture: Exploiting Potential Parallelism in Machine Vision

    Microsoft Academic Search

    Charles C. Weems; Edward M. Riseman; Allen R. Hanson

    1992-01-01

    A hardware architecture that addresses at least part of the potential parallelism in each of the three levels of vision abstraction, low (sensory), intermediate (symbolic), and high (knowledge-based), is described. The machine, called the image understanding architecture (IUA), consists of three different, tightly coupled parallel processors; the content addressable array parallel processor (CAAPP) at the low level, the intermediate communication

  3. Design of 4-kbit×4-layer optically coupled three-dimensional common memory for parallel processor system

    Microsoft Academic Search

    MITSUMASA KOYANAGI; HIROKAZU TAKATA; HIROKI MORI; JUNICHIRO IBA

    1990-01-01

    The optically coupled three-dimensional common (3D-OCC) memory is an intelligent memory for a real-time parallel processor system. It consists of a multilayered structure of two-dimensional memory with LEDs and photoconductors. The memory layers are optically coupled with each other through the LEDs and the photoconductors. Data are transferred in the vertical direction by optical coupling, while the conventional memory operations

  4. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    NASA Technical Reports Server (NTRS)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  5. Programmable retinal dynamics in a CMOS mixed-signal array processor chip

    NASA Astrophysics Data System (ADS)

    Carmona, Ricardo A.; Jimenez-Garrido, Francisco J.; Dominguez-Castro, Rafael; Espejo, Servando; Rodriguez-Vazquez, Angel

    2003-04-01

    The retina is responsible of the treatment of visual information at early stages. Visual stimuli generate patterns of activity that are transmitted through its layered structure up to the ganglion cells that interface it to the optical nerve. In this trip of micrometers, information is sustained by continuous signals that interact in excitatory and inhibitory ways. This low-level processing compresses the relevant information of the images to a manageable size. The behavior of the more external layers of the biological retina has been successfully modelled within the Cellular Neural Network framework. Interactions between cells are realized on a local basic. Each cell interacts with its nearest neighbors and every cell in the same layer follows the same interconnection pattern. Intra- and inter-layer interactions are continuous in magnitude and time. The evolution of the network can be described by a set of coupled nonlinear differential equations. A mixed-signal VLSI implementation of focal-plane low-level image processing based upon this biological model constitutes a feasible and cost effective alternative to conventional digital processing in real-time applications. A CMOS Programmable Array Processor prototype chip has been designed and fabricated in a standard technology. It has been successfully tested, validating the proposed design techniques. The integrated system consists of a network of 2 coupled layers, containing 32×32 elementary processors, running at different time constants. Involved image processing algorithms can be programmed on this chip by tuning the appropriate interconnection weights, internally coded as analog but programmed via a digital interface. Propagative, active wave phenomena and retina-lake effects can be observed in this chip. Low-level image processing tasks for early vision applications can be developed based on these high-order dynamics.

  6. Performance evaluation of the HEP, ELXSI and CRAY X-MP parallel processors on hydrocode test problems

    SciTech Connect

    Liebrock, L.M.; McGrath, J.F.; Hicks, D.L.

    1986-07-07

    Parallel programming promises improved processing speeds for hydrocodes, magnetohydrocodes, multiphase flow codes, thermal-hydraulics codes, wavecodes and other continuum dynamics codes. This paper presents the results of some investigations of parallel algorithms on three parallel processors: the CRAY X-MP, ELXSI and the HEP computers. Introduction and Background: We report the results of investigations of parallel algorithms for computational continuum dynamics. These programs (hydrocodes, wavecodes, etc.) produce simulations of the solutions to problems arising in the motion of continua: solid dynamics, liquid dynamics, gas dynamics, plasma dynamics, multiphase flow dynamics, thermal-hydraulic dynamics and multimaterial flow dynamics. This report restricts its scope to one-dimensional algorithms such as the von Neumann-Richtmyer (1950) scheme.

  7. Processing in Memory: The Terasys Massively Parallel PIM Array

    Microsoft Academic Search

    Maya Gokhale; William Holmes; Ken Iobst

    1995-01-01

    SRC researchers have designed and fabricated a processor-in-memory (PIM) chip, a standard 4-bit memory augmented with a single-bit ALU controlling each column of memory. In principle, PIM chips can replace the memory of any processor, including a supercomputer. To validate the notion of integrating SIMD computing into conventional processors on a more modest scale, we have built a half dozen

  8. Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986

    SciTech Connect

    Golub, G.H.

    1987-04-30

    The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

  9. Three-Dimensional Sequential\\/Parallel Universal Array Grammars and Object Pattern Analysis

    Microsoft Academic Search

    Patrick Shen-pei Wang

    1992-01-01

    We introduce a sequential\\/parallel parsing algorithm for analyzing 3-dimensional objects represented by 3-d array grammars. The mechanism serves as a compromise between purely sequential methods which take too much time, and purely parallel methods which take too much hardware for large digital arrays.

  10. Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90 *

    E-print Network

    Lee, Jenq-Kuen

    if these intrinsic functions are applied to sparse data sets. In this paper, we address this open gap by presenting­dimensional array objects concurrently. They provide a rich source of parallelism and play an increasingly important an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method

  11. Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor

    NASA Technical Reports Server (NTRS)

    Moore, J. Strother

    1992-01-01

    Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

  12. FPGA computing in a data parallel C

    Microsoft Academic Search

    Maya Gokhale; Ron Minnich

    1993-01-01

    The authors demonstrate a new technique for automatically synthesizing digital logic from a high level algorithmic description in a data parallel language. The methodology has been implemented using the Splash 2 reconfigurable logic arrays for programs written in Data-parallel Bit-serial C (dbC). The translator generates a VHDL description of a SIMD processor array with one or more processors per Xilinx

  13. Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks

    Microsoft Academic Search

    Maya Gokhale; Janice M. Stone

    1999-01-01

    FPGA-based processors, like many conventional DSP systems, often associate small high performance memories with each processing chip. These memories may be on-board embedded SRAMs or discrete parts. In the process of mapping a computation onto an FPGA processor, it is necessary to map the applications' data to memories. In this work, we present an algorithm that has been implemented in

  14. Clocking and circuit design for a parallel I\\/O on a first-generation CELL processor

    Microsoft Academic Search

    Ken Chang; Sudhakar Pamarti; Kambiz Kaviani; Elad Alon; Xudong Shi; Jie Shen; Gary Yip; Chris Madden; Ralf Schmitt; Chuck Yuan; Fari Assaderaghi; M. Horowitz

    2005-01-01

    A parallel I\\/O is integrated on a first-generation CELL processor in 90nm SOI CMOS. A clock-tracking architecture suppresses reference jitter to achieve 6.4Gbit\\/s\\/link operation at 21.6mW\\/Gbit\\/s. SOI effects on analog circuits, in particular high-speed receivers, are addressed to achieve a receiver sensitivity of ±12mV at 6.4Gbit\\/s with BER <10-14 measured using 7b PRBS data.

  15. Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

    NASA Astrophysics Data System (ADS)

    Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

    2014-05-01

    Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

  16. Simulation of a word recognition system on two parallel architectures

    SciTech Connect

    Yoder, M.A.; Jamieson, L.H. (Purdue Univ., Lafayette, IN (USA). Dept. of Electrical Engineering)

    1989-09-01

    When designing a parallel architecture it is advantageous to consider the applications for which the architecture will be used. This paper examines the use of two parallel architectures, a single instruction stream multiple data stream (SIMD) machine and a VLSI processor array, to implement an isolated word recognition system. SIMD and VLSI processor array algorithms were written for each of the components of the recognition system. The component parallel algorithms were simulated along with two complete recognition systems, one composed of SIMD algorithms and the other composed of VLSI processor array algorithms.

  17. Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2012-01-01

    The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

  18. Combined Application of Low Power Code Transformations and Sub word Parallelism Exploitation for VLIW Multimedia Processors

    Microsoft Academic Search

    K. Masselos; F. Catthoor; C. E. Goutis; H. DeMan

    In this paper we address the important issues in mapping multimedia applications on Very Long Instruction Word (VLIW) multimedia processors. The main design quality factors of applications realized on the target architecture platform are presented and their interactions are explored. Power consumption due to data storage and transfers forms a significant part of the total power budget of applications realized

  19. Power-Aware Scheduling for Parallel Security Processors with Analytical Models

    Microsoft Academic Search

    Yung-chia Lin; Yi-ping You; Chung-wen Huang; Jenq-Kuen Lee; Wei-Kuan Shih; Ting-Ting Hwang

    2004-01-01

    Techniques to reduce power dissipation for embedded systems have recently come into sharp focus in the technology development. Among these techniques, dynamic voltage scaling (DVS), power gating (PG), and multiple- domain partitioning are regarded as effective schemes to reduce dynamic and static power. In this paper, we investigate the problem of power-aware scheduling tasks running on a scalable encryption processor,

  20. Multimode power processor

    DOEpatents

    O'Sullivan, George A. (Pottersville, NJ); O'Sullivan, Joseph A. (St. Louis, MO)

    1999-01-01

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

  1. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    NASA Technical Reports Server (NTRS)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  2. Accelerating Haskell array codes with multicore GPUs

    Microsoft Academic Search

    Manuel M. T. Chakravarty; Gabriele Keller; Sean Lee; Trevor L. McDonell; Vinod Grover

    2011-01-01

    Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed

  3. QLISP for parallel processors. Final report, 15 July 1986-31 July 1988

    SciTech Connect

    McCarthy, J.

    1989-01-01

    The goal of the QLISP project at Stanford is to gain experience with the shared-memory, queue-based approach to parallel Lisp, by implementing the QLISP language on an actual multiprocessor, and by developing a symbolic algebra system as a testbed application. The experiments performed on the simulator included: 1. Algorithms for sorting and basic data-structure manipulation for polynomials. 2. Partitioning and scheduling methods for parallel programming. 3. Parallelizing the production rule system OPS5.

  4. NOISE IMPROVEMENT AND STOCHASTIC RESONANCE IN PARALLEL ARRAYS OF SENSORS WITH SATURATION

    E-print Network

    Chapeau-Blondeau, François

    NOISE IMPROVEMENT AND STOCHASTIC RESONANCE IN PARALLEL ARRAYS OF SENSORS WITH SATURATION David ROUSSEAU, Franc¸ois CHAPEAU-BLONDEAU Laboratoire d'Ing´enierie des Syst`emes Automatis´es (LISA), Universit

  5. Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

    SciTech Connect

    Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

    2005-09-20

    At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

  6. Multipoint parallel excitation and CCD-based imaging system for high-throughput fluorescence detection of biochip micro-arrays

    Microsoft Academic Search

    D. S. Mehta; C. Y. Lee; A. Chiou

    2001-01-01

    We report the development and the characterization of a multipoint parallel excitation and CCD-based imaging system for high-throughput fluorescence detection of biochip micro-arrays. A two-dimensional array of (19×19) points with uniform intensity distribution, generated by a holographic array generator, was used for parallel excitation of two-dimensional micro-arrays of fluorescence samples. A CCD-based imaging system was used for high-throughput parallel detection

  7. Design and evaluation of fault-tolerant VLSI/WSI processor arrays. Final technical report, 1 July 1985-31 December 1987

    SciTech Connect

    Fortes, J.A.

    1987-12-31

    This document is the final report of work performed under the project entitled Design and Evaluation of Fault-Tolerant VLSI/WSI Processor Arrays supported by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization and administered through the Office of Naval Research under Contract No. 00014-85-k-0588. With the concurrence of Dr. Clifford Lau, the Scientific Officer for this project, this final report consists of reprints of publications reporting work performed under the project. In the attached list of publications are papers where fault-tolerant systems for processor arrays are proposed and studied. Studies on algorithmic and software aspects relevant to the systems are also reported, as well as hardware and reconfigurability issues for fault-tolerant processor arrays.

  8. High-performance computational chemistry : hartree-fock electronic structure calculations on massively parallel processors.

    SciTech Connect

    Tilson, J. L.; Minkoff, M.; Wagner, A. F.; Shepard, R.; Sutton, P.; Harrison, R. J.; Kendall, R. A.; Wong, A. T.; PNNL

    1999-01-01

    The parallel performance of the NWChem version 1.2{alpha} parallel direct-SCF code has been characterized on five massively parallel supercomputers (IBM SP, Kendall Square KSR-2, CRAY T3D and T3E, and Intel Touchstone DELTA) using single-point energy calculations on seven molecules of varying size (up to 389 atoms) and composition (first-row atoms, halogens, and transition metals). The authors compare the performance using both replicated-data and distributed-data algorithms and the original McMurchie-Davidson and recently incorporated TEXAS integrals packages.

  9. One-dimensional optoacoustic receive array employing parallel detection and video-rate acquisition

    Microsoft Academic Search

    Ya Shu; Xinqing Guo; Mengyang Liu; Takashi Buma

    2010-01-01

    Optical techniques are a promising technology to realize high frequency ultrasound arrays. High sensitivity and broad bandwidth have been demonstrated with thin film etalon sensors. Etalon arrays usually involve synthetic aperture techniques, where a single probe laser beam or photodiode is scanned over the sensing region. True parallel detection suitable for video-rate B-mode imaging remains a challenge. We demonstrate a

  10. Low frequency noise in arrays of magnetic tunnel junctions connected in series and parallel

    Microsoft Academic Search

    R. Guerrero; M. Pannetier-Lecoeur; C. Fermon; S. Cardoso; R. Ferreira; P. P. Freitas

    2009-01-01

    Low frequency noise and small output voltage are the strongest limitations to the use of magnetic tunnel junctions (MTJs) for magnetic sensor applications, replacing giant magnetoresistance (GMR) and anisotropic magnetoresistance sensors. In this paper, we explore the possibility of using arrays with a large number of MTJs connected in parallel\\/series to overcome these limitations. MTJ's sensor arrays of more than

  11. The Panda Array I/O library on the Galley Parallel File System

    E-print Network

    1 The Panda Array I/O library on the Galley Parallel File System Joel T. Thomas Dartmouth Computer Joel.T.Thomas@dartmouth.edu June 5, 1996 Abstract The Panda Array I/O library, created some time, and the Panda project is an attempt to ameliorate this problem while still providing

  12. Automatic Synthesis of Parallel Programs Targeted to Dynamically Reconfigurable Logic Arrays

    Microsoft Academic Search

    Maya Gokhale; Aaron Marks

    1995-01-01

    Dynamically reconfigurable Field Programmable Gate Arrays (FPGAs) offer virtually unlimited numbers of gates to an application. This technology makes feasible large applications which can be temporally partitioned, with each phase being rapidly loaded onto the chip as required. We demonstrate in this paper an automatic technique to temporally partition a parallel program. Our technique partitions along a data parallel C

  13. Parallel algorithms for the maxima problem using an N-cube processor configuration

    E-print Network

    Coffman, Sarah Wilson

    1989-01-01

    tree of height of O(dNlogN), he pointed out that the decision tree model may be inadequate for the problem when d & 3, since the results obtained from its use do not correspond to the practical algorithms developed. Bentley et al [1978] researched... al. [1975] to prove that their algorithm for the three-dimensional maxima problem is optimal. This also led to their conclusion that the d-dimensional problem can be solved in 0(log N) time using a CREW PRAM with 0(N) processors for d & 3...

  14. Parallel algorithms for the maxima problem using an N-cube processor configuration 

    E-print Network

    Coffman, Sarah Wilson

    1989-01-01

    problem for a set of N d- dimensional points using an N ? cube processor conf iguration. Algorithms for the two- and three? dimensional problems are developed first. These algorithms are then extended to solve the maxima problem in higher dimensions... the maxima problem. The maxima problem is def ined in the following statements: Let S be a set of N d-dimensional points and let x(i, s) represent the st coordinate of i, where 1 & s & d. Let i and j be points contained in ST The point i is said...

  15. High-performance ultra-low power VLSI analog processor for data compression

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul (Inventor)

    1996-01-01

    An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

  16. Architecture studies and system demonstrations for optical parallel processor for AI and NI

    NASA Astrophysics Data System (ADS)

    Lee, Sing H.

    1988-03-01

    In solving deterministic AI problems the data search for matching the arguments of a PROLOG expression causes serious bottleneck when implemented sequentially by electronic systems. To overcome this bottleneck we have developed the concepts for an optical expert system based on matrix-algebraic formulation, which will be suitable for parallel optical implementation. The optical AI system based on matrix-algebraic formation will offer distinct advantages for parallel search, adult learning, etc.

  17. 390–480 GHz photon-assisted tunneling steps generated by parallel Josephson tunnel junction arrays

    Microsoft Academic Search

    F. Boussaha; A. Fe?ret; C. Chaumont; L. Pelay; M. Batrung; B. Lecomte; M. Salez; D. Bouville; F. Dauplay; J. Krieg; G. Beaudin; L. Lapierre

    2010-01-01

    We report on the first direct detection of submillimeter waves emitted by small parallel tunnel junction arrays. The arrays made up of 10 and 20 Nb\\/Al-AlOx\\/Nb junctions of 6 ?m2 is integrated and coupled in RF to Nb\\/Al-AlOx\\/Nb twin junction-based detector by a microstrip\\/slotline transition. The detector's I-V curve exhibits clearly photonassisted steps when the array is biased on the

  18. Analog processor design for potentiometric sensor array and its applications in smart living space

    Microsoft Academic Search

    Danny Wen-Yaw Chung; You-Lin Tsai; Tai-Tsun Liu; Chun-Liang Leu; Chung-Huang Yang; Dorota G. Pijanowska; Wladyslaw Torbicz; Piotr B. Grabiec; Bohdan Jaroszewicz

    2007-01-01

    This paper presents an analog processor design for ion sensitive field effect transistor (ISFET)-based flow through system and its application in smart living space. The dynamic flow-cell measurement explores more information compared to stationary measurement and is useful in environmental monitoring and electronic tongue systems. The multi-channel floating source readout circuitry has been developed for flow-through analysis of ion sensitive

  19. Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors

    E-print Network

    Rountev, Atanas "Nasko"

    . The starting point of the work reported in this paper is Pluto, a recently developed automatic parallelization-1-60558-397-6/09/02. . . $5.00 [34, 8, 7, 9, 6]. The key to Pluto's approach is the use of the polyhe- dral model [3, 36, 29

  20. Processor Allocation and Task Scheduling of Matrix Chain Products on Parallel Systems

    Microsoft Academic Search

    Heejo Lee; Jong Kim; Sung Je Hong; Sunggu Lee

    2003-01-01

    The problem of finding an optimal product sequence for sequential multiplication of a chain of matrices (the matrix chain ordering problem, MCOP) is well-known and has been studied for a long time. In this paper, we consider the problem of finding an optimal product schedule for evaluating a chain of matrix products on a parallel computer (the matrix chain scheduling

  1. Accelerating biomedical signal processing algorithms with parallel programming on graphic processor units

    Microsoft Academic Search

    Evdokimos I. Konstantinidis; Christos A. Frantzidis; Lazaros Tzimkas; Costas Pappas; P. D. Bamidis

    2009-01-01

    This paper investigates the benefits derived by adopting the use of Graphics Processing Unit (GPU) parallel programming in the field of biomedical signal processing. The differences in execution time when computing the Correlation Dimension (CD) of multivariate neurophysiological recordings and the Skin Conductance Level (SCL) are reported by comparing several common programming environments. Moreover, as indicated in this study, the

  2. Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations

    SciTech Connect

    Cleveland, Mathew A., E-mail: cleveland7@llnl.gov; Brunner, Thomas A.; Gentile, Nicholas A.; Keasler, Jeffrey A.

    2013-10-15

    We describe and compare different approaches for achieving numerical reproducibility in photon Monte Carlo simulations. Reproducibility is desirable for code verification, testing, and debugging. Parallelism creates a unique problem for achieving reproducibility in Monte Carlo simulations because it changes the order in which values are summed. This is a numerical problem because double precision arithmetic is not associative. Parallel Monte Carlo, both domain replicated and decomposed simulations, will run their particles in a different order during different runs of the same simulation because the non-reproducibility of communication between processors. In addition, runs of the same simulation using different domain decompositions will also result in particles being simulated in a different order. In [1], a way of eliminating non-associative accumulations using integer tallies was described. This approach successfully achieves reproducibility at the cost of lost accuracy by rounding double precision numbers to fewer significant digits. This integer approach, and other extended and reduced precision reproducibility techniques, are described and compared in this work. Increased precision alone is not enough to ensure reproducibility of photon Monte Carlo simulations. Non-arbitrary precision approaches require a varying degree of rounding to achieve reproducibility. For the problems investigated in this work double precision global accuracy was achievable by using 100 bits of precision or greater on all unordered sums which where subsequently rounded to double precision at the end of every time-step.

  3. Electrostatic quadrupole array for focusing parallel beams of charged particles

    DOEpatents

    Brodowski, John (Smithtown, NY)

    1982-11-23

    An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

  4. Representing S-expressions for the efficient evaluation of Lisp on parallel processors

    SciTech Connect

    Harrison, W.L. III; Padua, D.A.

    1986-03-01

    Present methods for exploiting parallelism in Lisp programs perform poorly upon lists (long, flat s-expressions), as such structures must be both created and traversed sequentially. While such a serial operation may be masked by overlapping it with other computation (by virtue of process spawning, or by the use of a mechanism such as futures), it represents a lost (and potentially large) source of parallelism. In this paper we describe the representation of s-expressions employed in PARCEL (Project for the Automatic Restructuring and Concurrent Evaluation of Lisp), which facilitates the creation and access of lists, without compromising the performance of functions which manipulate s-expressions of a more general shape. Using this representation, the PARCEL compiler translates Lisp programs written in a subset of the Scheme dialect (which allows for global variables and atom properties) into code for a large, tightly coupled shared memory multiprocessor. 12 refs.

  5. Reconfigurable Parallel VLSI CoProcessor for Space Robot Using FPGA

    Microsoft Academic Search

    R. Wei; M. H. Jin; J. J. Xia; Z. W. Xie; Hong Liu

    2006-01-01

    This paper proposes hardware solutions to the computation for the trigonometric and square root functions of inverse kinematics. They are based on an existing pipeline arithmetic which employs the CORDIC(Coordinate Rotation Digital Computer) algorithm. This integrated approach enhances computational efficiency by reducing the duplicate calculations of this functions and maximizing the parallel\\/pipelining processing for real-time robot control. The reliability of

  6. Design and Implementation of the MorphoSys Reconfigurable Computing Processor

    Microsoft Academic Search

    Ming-hau Lee; Hartej Singh; Guangming Lu; Nader Bagherzadeh; Fadi J. Kurdahi; Eliseu M. Chaves Filho; Vladimir Castro Alves

    2000-01-01

    In this paper, we describe the implementation of MorphoSys, a reconfigurable processing system targeted at data-parallel and computation-intensive applications. The MorphoSys architecture consists of a recon- figurable component (an array of reconfigurable cells) combined with a RISC control processor and a high bandwidth memory interface. We briefly discuss the system-level model, array architecture, and control processor. Next, we present the

  7. PostProcessor Development of a Hybrid TRR-XY Parallel Kinematic Machine Tool

    Microsoft Academic Search

    S.-L. Chen; T.-H. Chang; I. Inasaki; Y.-C. Liu

    2002-01-01

    A hybrid 5-degrees-of-freedom parallel kinematic machine tool constructed using the TRR-XY mechanism has been used to investigate\\u000a the theory of post-processing. The effects of the cutter shapes and machine construction on the post-processing are investigated.\\u000a Some specific parameters only are required to modify the post-processing for the different tools used in real cutting. The\\u000a tilt angle and yaw angle of

  8. Transmissive Nanohole Arrays for Massively-Parallel Optical Yanan Wang,

    E-print Network

    Bao, Jiming

    -HEL) was used as a model analyte, giving a detection limit as low as 0.1 ng/mL. KEYWORDS: nanohole array channels to create portable lab-on-a-chip devi- ces.6,10,11,15 While EOT nanoholes are compact-transmission EOT-based nanohole techniques from being widely used in laboratories or clinical diagnostics. Because

  9. Array combination for parallel imaging in Magnetic Resonance Imaging 

    E-print Network

    Spence, Dan Kenrick

    2007-09-17

    In Magnetic Resonance Imaging, the time required to generate an image is proportional to the number of steps used to encode the spatial information. In rapid imaging, an array of coil elements and receivers are used to reduce the number of encoding...

  10. Memory Parallelism Using Custom Array Mapping to Heterogeneous Storage Structures

    Microsoft Academic Search

    Nastaran Baradaran; Pedro C. Diniz

    2006-01-01

    Configurable architectures offer the unique opportunity of customizing the storage allocation to meet specific applications¿ needs. In this paper we describe a compiler approach to map the arrays of a loop-based computation to internal memories of a configurable architecture with the objective of minimizing the overall execution time. We present an algorithm that considers the data access patterns of the

  11. Breast ultrasound tomography with two parallel transducer arrays: preliminary clinical results

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Intrator, Miranda; Hanson, Kenneth; Epstein, Katherine; Sandoval, Daniel; Williamson, Michael

    2015-03-01

    Ultrasound tomography has great potential to provide quantitative estimations of physical properties of breast tumors for accurate characterization of breast cancer. We design and manufacture a new synthetic-aperture breast ultrasound tomography system with two parallel transducer arrays. The distance of these two transducer arrays is adjustable for scanning breasts with different sizes. The ultrasound transducer arrays are translated vertically to scan the entire breast slice by slice and acquires ultrasound transmission and reflection data for whole-breast ultrasound imaging and tomographic reconstructions. We use the system to acquire patient data at the University of New Mexico Hospital for clinical studies. We present some preliminary imaging results of in vivo patient ultrasound data. Our preliminary clinical imaging results show promising of our breast ultrasound tomography system with two parallel transducer arrays for breast cancer imaging and characterization.

  12. Series-Parallel Superconducting Quantum Interference Device Arrays Using High-TC Ion Damage Junctions

    NASA Astrophysics Data System (ADS)

    Wong, Travis; Mukhanov, Oleg

    2015-03-01

    We have fabricated several designs of three junction series-parallel DC Superconducting Quantum Interference Device (BiSQUID) arrays in YBa2Cu3O7-x using 104 ion damage Josephson Junctions on a single 1 cm2 chip. A high aspect ratio ion implantation mask (30:1 ratio) with 30 nm slits was fabricated using electron beam lithography and low pressure reactive ion etching. Samples were irradiated with 60 keV helium ions to achieve a highly uniform damaged region throughout the thickness of the YBCO thin film as confirmed with Monte Carlo ion implantation simulations. Low frequency measurements of four different BiSQUID series-parallel SQUID array devices will be presented to investigate the effect of the BiSQUID design parameters on the linearity of the SQUID array in response to magnetic fields. BiSQUID arrays could provide a promising architecture for improved linearity transimpedance amplifiers with high linearity.

  13. Development of Microreactor Array Chip-Based Measurement System for Massively Parallel Analysis of Enzymatic Activity

    NASA Astrophysics Data System (ADS)

    Hosoi, Yosuke; Akagi, Takanori; Ichiki, Takanori

    Microarray chip technology such as DNA chips, peptide chips and protein chips is one of the promising approaches for achieving high-throughput screening (HTS) of biomolecule function since it has great advantages in feasibility of automated information processing due to one-to-one indexing between array position and molecular function as well as massively parallel sample analysis as a benefit of down-sizing and large-scale integration. Mostly, however, the function that can be evaluated by such microarray chips is limited to affinity of target molecules. In this paper, we propose a new HTS system of enzymatic activity based on microreactor array chip technology. A prototype of the automated and massively parallel measurement system for fluorometric assay of enzymatic reactions was developed by the combination of microreactor array chips and a highly-sensitive fluorescence microscope. Design strategy of microreactor array chips and an optical measurement platform for the high-throughput enzyme assay are discussed.

  14. Capanic: A Parallel Tree N-Body Code for Inhomogeneous Clusters of Processors

    E-print Network

    V. Antonuccio-Delogu; U. Becciani

    1994-06-24

    We have implemented a parallel version of the Barnes-Hut 3-D N-body tree algorithm under PVM 3.2.5, adopting an SPMD paradigm. We parallelize the problem by decomposing the physical domain by means of the {\\bf Orthogonal Recursive Bisection} oct-tree scheme suggested by Salmon (1991), but we modify the original hypercube communication pattern into an incomplete hypercube, which is more suitable for a generic inhomogenous cluster architecture.\\\\ We address dynamical load balancing by assigning different "weights" to the spawned tasks according to the dynamically changing workloads of each task. The weights are determined by monitoring the local platforms where the tasks are running and estimating the performance of each task. The monitoring scheme is flexible and allows us to address at the same time cluster and intrinsic sources of load imbalance. We then show measurements of the performance of our code on a test case of astrophysical interest in order to test the performance of our implementation.

  15. Extended Aperture 2-D Direction Finding With a Two-Parallel-Shape-Array Using Propagator Method

    Microsoft Academic Search

    Jin He; Zhong Liu

    2009-01-01

    In this letter, we propose a two-parallel-shape array geometry, consisting of sensors spaced much farther apart than a half-wavelength, to improve estimation accuracy via aperture extension for two-dimensional (2D) direction finding. First, the subarray parallel with the x-axis is employed to extract automatically paired high-variance but unambiguous y-axis direction cosines and low-variance but cyclically ambiguous x-axis direction cosines. Then, the

  16. Method of up-front load balancing for local memory parallel processors

    NASA Technical Reports Server (NTRS)

    Baffes, Paul Thomas (inventor)

    1990-01-01

    In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

  17. Parallel RNA extraction using magnetic beads and a droplet array

    PubMed Central

    Shi, Xu; Chen, Chun-Hong; Gao, Weimin; Meldrum, Deirdre R.

    2015-01-01

    Nucleic acid extraction is a necessary step for most genomic/transcriptomic analyses, but it often requires complicated mechanisms to be integrated into a lab-on-a-chip device. Here, we present a simple, effective configuration for rapidly obtaining purified RNA from low concentration cell medium. This Total RNA Extraction Droplet Array (TREDA) utilizes an array of surface-adhering droplets to facilitate the transportation of magnetic purification beads seamlessly through individual buffer solutions without solid structures. The fabrication of TREDA chips is rapid and does not require a microfabrication facility or expertise. The process takes less than 5 minutes. When purifying mRNA from bulk marine diatom samples, its repeatability and extraction efficiency are comparable to conventional tube-based operations. We demonstrate that TREDA can extract the total mRNA of about 10 marine diatom cells, indicating that the sensitivity of TREDA approaches single-digit cell numbers. PMID:25519439

  18. A New FPGA\\/DSP-Based Parallel Architecture for Real-Time Image Processing

    Microsoft Academic Search

    Joan Batlle; Joan Martí; Pere Ridao; Josep Amat

    2002-01-01

    I n this article, we present a new reconfigurable parallel architecture oriented to video-rate computer vision applications. This architecture is structured with a two-dimensional (2D) array of FPGA\\/DSP-based reprogrammable processors Pij. These processors are inter- connected by means of a systolic 2D array of FPGA-based video-addressing units which allow video-rate links between any two processors in the net to overcome

  19. Graphics-processor-unit-based parallelization of optimized baseline wander filtering algorithms for long-term electrocardiography.

    PubMed

    Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf

    2015-06-01

    Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced. PMID:25675449

  20. Massively ParallelInner-ProductArray Processor Roman Genov and Gert Cauwenberghs

    E-print Network

    Genov, Roman

    (n),and N x M ma- trix of stored elements W(m*n).In artificial neural net- works, for instance, the matrix in a SupportVector Machine (SVM) [2]. Most of modern neural networks realizations contain a vector bandwidth needed for efficient real- time implementation. Multiprocessors and networked par- allel computers

  1. Method for controlling propagation of data and transform through memory-linked wavefront array processor

    Microsoft Academic Search

    Dolecek

    1990-01-01

    This patent describes a method for controlling propagation of data and transforms through a linear array of multiple processing elements interspersed with linking dual port memories where each dual port memory can be accessed simultaneously and without contention by processing elements located on its left and right and where each processing element can be locally controlled by at lest one

  2. Optical free-space interconnection in mesh-array processor architecture for beam forming radar processing

    Microsoft Academic Search

    Sylvain Paineau; Jean-Pierre Ghesquiers; Michel Charrier

    1995-01-01

    Optical interconnection is becoming very prevalent for computer architects, namely for massively parallel processing. This is more and more admitted but the optical technologies must fit constraints of these architectures. In this paper we reflect a technology-oriented research based on a view architecture-oriented : SPMD (Single Program Multiple Data) type machine. We have demonstrated OFSI (Optical Free Space Interconnection) between

  3. Rapid Micro Array System for Passive Batch-Filling and Parallel-Printing Protein Solutions

    Microsoft Academic Search

    Cheng-En Ho; Chin-Chang Chieng; Min-Hung Chen; Fan-Gang Tseng

    2007-01-01

    This paper provides a novel micro contact printing system with batch-filling and parallel printing capability for rapid generation of protein arrays. Micro filling chip can simultaneously transfer tens to hundreds of protein solutions into the micro stamp chip in seconds by capillary force without cross-contamination, while maintaining the functionality of proteins before the application. Different proteins can be dispensed into

  4. Stream Processors

    NASA Astrophysics Data System (ADS)

    Erez, Mattan; Dally, William J.

    Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.

  5. Using Emulations to Enhance the Performance of Parallel Architectures

    Microsoft Academic Search

    Bojana Obrenic; Martin C. Herbordt; Arnold L. Rosenberg; Charles C. Weems

    1999-01-01

    ÐWe illustrate the potential of techniques and results from the theory of network emulations to enhance the performance of a parallel architecture. The vehicle for this demonstration is a suite of algorithms that endow an N-processor bit-serial processor array A with a ªmeta-instructionº GAUGE k, which (logically) reconfigures A into an N=k-processor virtual machine Bk that has: 1) a datapath

  6. Efficient Algorithms for Parallel Excitation and Parallel Imaging with Large Arrays 

    E-print Network

    Feng, Shuo

    2013-08-12

    in reconstructions. 2.3 Parallel Excitation The field strength of the current clinical scanners are advancing to 3 Tesla or even 7 Tesla which can tremendously improve the imaging quality. However, many high field related problems remain unsolved, for example...

  7. Mitigation of cache memory using an embedded hard-core PPC440 processor in a Virtex-5 Field Programmable Gate Array.

    SciTech Connect

    Learn, Mark Walter

    2010-02-01

    Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not available to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.

  8. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm

    PubMed Central

    Guo, Xinyu; Wang, Hong; Devabhaktuni, Vijay

    2012-01-01

    A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. PMID:25969747

  9. A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm.

    PubMed

    Guo, Xinyu; Wang, Hong; Devabhaktuni, Vijay

    2012-01-01

    A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. PMID:25969747

  10. 1 ms column parallel vision system and its application of high speed target tracking

    Microsoft Academic Search

    Y. Nakabo; M. Ishikawa; H. Toyoda; S. Mizuno

    2000-01-01

    We have developed a 1 ms vision system, to provide a much faster frame rate than that of the conventional systems. Our 1 ms vision system has a 128×128 PD array and an all parallel processor array connected to each other in a column parallel architecture, so that the bottleneck of an image transfer is solved. 1 ms visual feedback

  11. Sequence information signal processor

    DOEpatents

    Peterson, John C. (Alta Loma, CA); Chow, Edward T. (San Dimas, CA); Waterman, Michael S. (Culver City, CA); Hunkapillar, Timothy J. (Pasadena, CA)

    1999-01-01

    An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

  12. Experimental analysis of the phase dynamics in small parallel arrays of Josephson junctions

    SciTech Connect

    Gambardella, U.; Grimaldi, G.; Caputo, P.; Pace, S. [Istituto Nazionale per la Fisica della Materia, and Dipartimento di Fisica, Universita di Salerno, 84081SA Baronissi (Italy)] [Istituto Nazionale per la Fisica della Materia, and Dipartimento di Fisica, Universita di Salerno, 84081SA Baronissi (Italy)

    1997-10-01

    We analyze Fiske resonances of one-dimensional parallel arrays of underdamped Josephson tunnel junctions. They appear in the current voltage (I{endash}V) characteristics as resonant current singularities (steps) at finite voltages V{sub m} when a magnetic field H is applied perpendicular to the array cells. We present measurements of current step amplitudes I{sub cm}, and of the maximum Josephson current I{sub c0} as a function of H, for arrays made of four, six, and ten small Josephson junctions. The I{endash}V characteristics of the arrays exhibit three, five, and eight resonant current steps, respectively, at increasing voltages. In all devices we find that the current amplitude of the highest order step has just one maximum occurring at H{approx}1/2H{sup {asterisk}}, being H{sup {asterisk}} the first field value where I{sub c0}(H{sup {asterisk}}){approx}I{sub c0}(0). Numerical simulations of the phase dynamics in small parallel arrays as a function of the applied magnetic flux are performed. The results of the simulation reproduce the experimentally observed features. {copyright} {ital 1997 American Institute of Physics.}

  13. ReConfigurable Parallel Stream Processor with Self-Assembling and Self-Restorable MicroArchitecture

    Microsoft Academic Search

    Lev Kirischian; Irina Terterian; Pil Woo Chun; Vadim Geurkov

    2004-01-01

    In this paper we present a concept of the self-assembling micro-architectures of Application Specific Virtual Processors for data-stream processing. The procedure for micro-architecture assembling is developed for Xilinx \\

  14. Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management

    Microsoft Academic Search

    Pierre G. Paulin; Chuck Pilkington; Michel Langevin; Essaid Bensoudane; Gabriela Nicolescu

    2004-01-01

    In this paper, we describe the MultiFlex multi-processor SoC programming environment, with focus on two programming models: a distributed system object component (DSOC) message passing model, and a symmetrical multi-processing (SMP) model using shared memory. The MultiFlex tools map these models onto the StepNP multi-processor SoC platform, while making use of harware accelerators for message passing and task scheduling. We

  15. Parallel SPM cantilever arrays for large area surface metrology and lithography

    NASA Astrophysics Data System (ADS)

    Gotszalk, Teodor; Ivanov, Tzvetan; Rangelow, Ivo W.

    2014-04-01

    In this paper technology of scanning probe microscopy (SPM) surface metrology using arrays of piezoresistive thermally actuated cantilevers is discussed. The cantilever architecture presented here makes it possible to image surface topography using sensors operating in parallel. In this way the throughput of the sample imaging is increased, which is of crucial importance in measurements of large area samples. Application of piezoresistive detection scheme makes it possible to investigate quantitatively the interaction between the microprobe and the imaged surface. Integration of the thermal deflection actuator with the spring beam decreases the response time and enables fast and high resolution control of the tip sample distance. The results of topography parallel measurement using 1×4 cantilever array will be presented.

  16. A 32-Channel Lattice Transmission Line Array for Parallel Transmit and Receive MRI at 7 Tesla

    PubMed Central

    Adriany, Gregor; Auerbach, Edward J.; Snyder, Carl J.; Gözübüyük, Ark; Moeller, Steen; Ritter, Johannes; van de Moortele, Pierre-Francois; Vaughan, Tommy; U?urbil, Kamil

    2010-01-01

    Transmit and receive RF coil arrays have proven to be particularly beneficial for ultra-high-field MR. Transmit coil arrays enable such techniques as B1+ shimming to substantially improve transmit B1 homogeneity compared to conventional volume coil designs, and receive coil arrays offer enhanced parallel imaging performance and SNR. Concentric coil arrangements hold promise for developing transceiver arrays incorporating large numbers of coil elements. At magnetic field strengths of 7 tesla and higher where the Larmor frequencies of interest can exceed 300 MHz, the coil array design must also overcome the problem of the coil conductor length approaching the RF wavelength. In this study, a novel concentric arrangement of resonance elements built from capacitively-shortened half-wavelength transmission lines is presented. This approach was utilized to construct an array with whole-brain coverage using 16 transceiver elements and 16 receive-only elements, resulting in a coil with a total of 16 transmit and 32 receive channels. PMID:20512850

  17. Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation.

    PubMed

    Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan

    2015-08-01

    Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ?7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection. PMID:26180050

  18. Stable parallel algorithms for computing and updating the QR decomposition

    Microsoft Academic Search

    E. J. Kontoghiorghes; M. R. B. Clarke

    1993-01-01

    We propose new stable parallel algorithms based on Householder transformations and compound Given's rotations to compute the QR decomposition of a rectangular matrix. The predicted execution time of all algorithms on the massively parallel SIMD array processor AMT DAP 510, have been obtained and analyzed. Modified versions of these algorithms are also considered for updating the QR decomposition, when rows

  19. Weak-periodic stochastic resonance in a parallel array of static nonlinearities.

    PubMed

    Ma, Yumei; Duan, Fabing; Chapeau-Blondeau, François; Abbott, Derek

    2013-01-01

    This paper studies the output-input signal-to-noise ratio (SNR) gain of an uncoupled parallel array of static, yet arbitrary, nonlinear elements for transmitting a weak periodic signal in additive white noise. In the small-signal limit, an explicit expression for the SNR gain is derived. It serves to prove that the SNR gain is always a monotonically increasing function of the array size for any given nonlinearity and noisy environment. It also determines the SNR gain maximized by the locally optimal nonlinearity as the upper bound of the SNR gain achieved by an array of static nonlinear elements. With locally optimal nonlinearity, it is demonstrated that stochastic resonance cannot occur, i.e. adding internal noise into the array never improves the SNR gain. However, in an array of suboptimal but easily implemented threshold nonlinearities, we show the feasibility of situations where stochastic resonance occurs, and also the possibility of the SNR gain exceeding unity for a wide range of input noise distributions. PMID:23505523

  20. Weak-Periodic Stochastic Resonance in a Parallel Array of Static Nonlinearities

    PubMed Central

    Ma, Yumei; Duan, Fabing; Chapeau-Blondeau, François; Abbott, Derek

    2013-01-01

    This paper studies the output-input signal-to-noise ratio (SNR) gain of an uncoupled parallel array of static, yet arbitrary, nonlinear elements for transmitting a weak periodic signal in additive white noise. In the small-signal limit, an explicit expression for the SNR gain is derived. It serves to prove that the SNR gain is always a monotonically increasing function of the array size for any given nonlinearity and noisy environment. It also determines the SNR gain maximized by the locally optimal nonlinearity as the upper bound of the SNR gain achieved by an array of static nonlinear elements. With locally optimal nonlinearity, it is demonstrated that stochastic resonance cannot occur, i.e. adding internal noise into the array never improves the SNR gain. However, in an array of suboptimal but easily implemented threshold nonlinearities, we show the feasibility of situations where stochastic resonance occurs, and also the possibility of the SNR gain exceeding unity for a wide range of input noise distributions. PMID:23505523

  1. Numerical Study of a Crossed Loop Coil Array for Parallel Magnetic Resonance Imaging

    SciTech Connect

    Hernandez, J.; Solis, S. E.; Rodriguez, A. O. [Centro de Investigacion e Instrumentacion e Imagenoloia Medica, Universidad Autonoma Metropolitana Iztapalapa, Mexico DF 09340 (Mexico)

    2008-08-11

    A coil design has been recently proposed by Temnikov (Instrum Exp Tech. 2005;48;636-637), with higher experimental signal-to-noise ratio than that of the birdcage coil. It is also claimed that it is possible to individually tune it with a single chip capacitor. This coil design shows a great resemble to the gradiometer coil. These results motivated us to numerically simulate a three-coil array for parallel magnetic resonance imaging and in vivo magnetic resonance spectroscopy with multi nuclear capability. The magnetic field was numerical simulated by solving Maxwell's equations with the finite element method. Uniformity profiles were calculated at the midsection for one single coil and showed a good agreement with the experimental data. Then, two more coils were added to form two different coil arrays: coil elements were equally distributed by an angle of a 30 deg. angle. Then, uniformity profiles were calculated again for all cases at the midsection. Despite the strong interaction among all coil elements, very good field uniformity can be achieved. These numerical results indicate that this coil array may be a good choice for magnetic resonance imaging parallel imaging.

  2. Parallel and series FED microstrip array with high efficiency and low cross polarization

    NASA Technical Reports Server (NTRS)

    Huang, John (inventor)

    1995-01-01

    A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

  3. Microfluidic Reactor Array Device for Massively Parallel In-situ Synthesis of Oligonucleotides

    PubMed Central

    Srivannavit, Onnop; Gulari, Mayurachat; Hua, Zhishan.; Gao, Xiaolian; Zhou, Xiaochuan; Hong, Ailing; Zhou, Tiecheng; Gulari, Erdogan

    2009-01-01

    We have designed and fabricated a microfluidic reactor array device for massively parallel in-situ synthesis of oligonucleotides (oDNA). The device is made of glass anodically bonded to silicon consisting of three level features: microreactors, microchannels and through inlet/outlet holes. Main challenges in the design of this device include preventing diffusion of photogenerated reagents upon activation and achieving uniform reagent flow through thousands of parallel reactors. The device embodies a simple and effective dynamic isolation mechanism which prevents the intermixing of active reagents between discrete microreactors. Depending on the design parameters, it is possible to achieve uniform flow and synthesis reaction in all of the reactors by proper design of the microreactors and the microchannels. We demonstrated the use of this device on a solution-based, light-directed parallel in-situ oDNA synthesis. We were able to synthesize long oDNA, up to 120 mers at stepwise yield of 98 %. The quality of our microfluidic oDNA microarray including sensitivity, signal noise, specificity, spot variation and accuracy was characterized. Our microfluidic reactor array devices show a great potential for genomics and proteomics researches. PMID:20161215

  4. Norway's ERS-1 SAR Processor

    Microsoft Academic Search

    Sverre Holm; A. Maoy; E.-A. Herland

    1990-01-01

    A high-performance processing facility for Synthetic Aperture Radar (SAR) is described. The SAR processor is designed for the ERS-1 remote sensing satellite and will process one 100 km by 100 km scene in six to seven minutes. The SAR processor is built around a 320 MFLOPS parallel processor. The front-end processor is a mini-computer which provides the input\\/output capacity necessary

  5. Parallel FDTD Modeling of a Focal Plane Array with Vivaldi Elements on the Highly Parallel LOFAR BlueGene\\/L Supercomputer

    Microsoft Academic Search

    R. Maaskant; M. V. Ivashina; R. Mittra; W. Yu; N.-T. Huang

    2006-01-01

    This paper describes some preliminary results of a numerical study towards the modeling of a focal plane array (FPA) with Vivaldi elements, carried out on the LOFAR BlueGene\\/L, a supercomputer with more than 6000 nodes located at the University of Groningen. The parallel finite-difference time domain (PFDTD) code was used to simulate the array because it is able to achieve

  6. Design and implementation of a parallel array operator for the arbitrary remapping of data.

    SciTech Connect

    Dietz, Steven; Choi, S. E. (Sung-Eun); Chamberlain, B. L. (Bradford L.); Snyder, Lawrence

    2003-01-01

    The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.

  7. Nanopore arrays in a silicon membrane for parallel single-molecule detection: fabrication.

    PubMed

    Schmidt, Torsten; Zhang, Miao; Sychugov, Ilya; Roxhed, Niclas; Linnros, Jan

    2015-08-01

    Solid state nanopores enable translocation and detection of single bio-molecules such as DNA in buffer solutions. Here, sub-10 nm nanopore arrays in silicon membranes were fabricated by using electron-beam lithography to define etch pits and by using a subsequent electrochemical etching step. This approach effectively decouples positioning of the pores and the control of their size, where the pore size essentially results from the anodizing current and time in the etching cell. Nanopores with diameters as small as 7 nm, fully penetrating 300 nm thick membranes, were obtained. The presented fabrication scheme to form large arrays of nanopores is attractive for parallel bio-molecule sensing and DNA sequencing using optical techniques. In particular the signal-to-noise ratio is improved compared to other alternatives such as nitride membranes suffering from a high-luminescence background. PMID:26180043

  8. EO pumping through a periodic array of parallel slats: effect of porosity

    NASA Astrophysics Data System (ADS)

    Kung, Chun-Fei; Wang, Chang-Yi; Chang, Chien-Cheng

    2011-11-01

    A periodic array of parallel slats is considered either as a model for porous medium/channel or as a means for electro-osmotic (EO) pumping. Four length scales are involved: the vertical period L, lateral period aL, the width of the slat cL as well as the Debye length ?D of the electric double layer (EDL). The purpose of this study is to examine the efficiency of EO pumping in terms of the flow rate Q versus the normalized lengths: a , c and K = L /?D . The method of series expansions with boundary collocation is performed for both longitudinal and transverse electro-osmotic pumping (LEOP and TEOP). The main findings include (1) The pumping rate Q of LEOP for an array with large porosity (c / a =0.3) is always larger than that of the array with full slats (without porosity, c / a =1). (2) At small K (<1.5), the behavior of Q of TEOP for large porosity (c / a = 0.1) is similar to that for LEOP, but the effect in improving the pumping rate is less significant. On the contrary, at large K (>3), the effect is the opposite: the pumping rate Q with large porosity is smaller than that of the full slats, especially when the period of the array is short (a is close to 1).

  9. Performance of the UCAN2 Gyrokinetic Particle In Cell (PIC) Code on Two Massively Parallel Mainframes with Intel ``Sandy Bridge'' Processors

    NASA Astrophysics Data System (ADS)

    Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul

    2013-10-01

    The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.

  10. Optoelectronic parallel processing with smart pixel arrays for automated screening of cervical smear imagery

    NASA Astrophysics Data System (ADS)

    Metz, John Langdon

    2000-10-01

    This thesis investigates the use of optoelectronic parallel processing systems with smart photosensor arrays (SPAs) to examine cervical smear images. The automation of cervical smear screening seeks to reduce human workload and improve the accuracy of detecting pre- cancerous and cancerous conditions. Increasing the parallelism of image processing improves the speed and accuracy of locating regions-of-interest (ROI) from images of the cervical smear for the first stage of a two-stage screening system. The two-stage approach first detects ROI optoelectronically before classifying them using more time consuming electronic algorithms. The optoelectronic hit/miss transform (HMT) is computed using gray scale modulation spatial light modulators in an optical correlator. To further the parallelism of this system, a novel CMOS SPA computes the post processing steps required by the HMT algorithm. The SPA reduces the subsequent bandwidth passed into the second, electronic image processing stage classifying the detected ROI. Limitations in the miss operation of the HMT suggest using only the hit operation for detecting ROI. This makes possible a single SPA chip approach using only the hit operation for ROI detection which may replace the optoelectronic correlator in the screening system. Both the HMT SPA postprocessor and the SPA ROI detector design provide compact, efficient, and low-cost optoelectronic solutions to performing ROI detection on cervical smears. Analysis of optoelectronic ROI detection with electronic ROI classification shows these systems have the potential to perform at, or above, the current error rates for manual classification of cervical smears.

  11. Computation and parallel implementation for early vision

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    1990-01-01

    The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

  12. A micromachined silicon parallel acoustic delay line (PADL) array for real-time photoacoustic tomography (PAT)

    NASA Astrophysics Data System (ADS)

    Cho, Young Y.; Chang, Cheng-Chung; Wang, Lihong V.; Zou, Jun

    2015-03-01

    To achieve real-time photoacoustic tomography (PAT), massive transducer arrays and data acquisition (DAQ) electronics are needed to receive the PA signals simultaneously, which results in complex and high-cost ultrasound receiver systems. To address this issue, we have developed a new PA data acquisition approach using acoustic time delay. Optical fibers were used as parallel acoustic delay lines (PADLs) to create different time delays in multiple channels of PA signals. This makes the PA signals reach a single-element transducer at different times. As a result, they can be properly received by single-channel DAQ electronics. However, due to their small diameter and fragility, using optical fiber as acoustic delay lines poses a number of challenges in the design, construction and packaging of the PADLs, thereby limiting their performances and use in real imaging applications. In this paper, we report the development of new silicon PADLs, which are directly made from silicon wafers using advanced micromachining technologies. The silicon PADLs have very low acoustic attenuation and distortion. A linear array of 16 silicon PADLs were assembled into a handheld package with one common input port and one common output port. To demonstrate its real-time PAT capability, the silicon PADL array (with its output port interfaced with a single-element transducer) was used to receive 16 channels of PA signals simultaneously from a tissue-mimicking optical phantom sample. The reconstructed PA image matches well with the imaging target. Therefore, the silicon PADL array can provide a 16× reduction in the ultrasound DAQ channels for real-time PAT.

  13. New computing environments:Parallel, vector and systolic

    SciTech Connect

    Wouk, A.

    1986-01-01

    This book presents papers on supercomputers and array processors. Topics considered include nested dissection, the systolic level 2 BLAS, parallel processing a hydrodynamic shock wave problem, MACH-1, portable standard LISP on the Cray, distributed combinator evaluation, performance and library issues, scale problems, multiprocessor architecture, the MIDAS multiprocessor system, parallel algorithms for incompressible and compressible flows on a multiprocessor, and parallel algorithms for elliptic equations.

  14. A Fast Parallel Implementation of the Berlekamp-Massey Algorithm with a One-D Systolic Array Architecture

    Microsoft Academic Search

    Shojiro Sakata; Masazumi Kurihara

    1995-01-01

    In this paper we present a fast parallel version of the BM algorithm based on a one-dimensional (1D) or linear systolic array architecture which is composed of a series of m cells (processing units), where m is the size of the given data, i.e., the length of the input sequence. The 1D systolic array has only local communication links between

  15. A Parallel\\/Series Array of Cold-Electron Bolometers with SIN Tunnel Junctions for Cosmology Instruments

    Microsoft Academic Search

    Leonid Kuzmin

    A novel concept of the parallel\\/series array of Cold-Electron Bolometers (CEB) with Superconductor-Insulator-Normal (SIN) Tunnel Junctions has been proposed for matching with JFET readout. The current-biased CEBs are connected in series for DC and in parallel for HF signal. A signal is concentrated to the absorber through the capacitance of tunnel junctions and additional capacitance for coupling of superconducting islands.

  16. Modeling of the phase lag causing fluidelastic instability in a parallel triangular tube array

    NASA Astrophysics Data System (ADS)

    Khalifa, Ahmed; Weaver, David; Ziada, Samir

    2013-11-01

    Fluidelastic instability is considered a critical flow induced vibration mechanism in tube and shell heat exchangers. It is believed that a finite time lag between tube vibration and fluid response is essential to predict the phenomenon. However, the physical nature of this time lag is not fully understood. This paper presents a fundamental study of this time delay using a parallel triangular tube array with a pitch ratio of 1.54. A computational fluid dynamics (CFD) model was developed and validated experimentally in an attempt to investigate the interaction between tube vibrations and flow perturbations at lower reduced velocities Ur=1-6 and Reynolds numbers Re=2000-12 000. The numerical predictions of the phase lag are in reasonable agreement with the experimental measurements for the range of reduced velocities Ug/fd=6-7. It was found that there are two propagation mechanisms; the first is associated with the acoustic wave propagation at low reduced velocities, Ur<2, and the second mechanism for higher reduced velocities is associated with the vorticity shedding and convection. An empirical model of the two mechanisms is developed and the phase lag predictions are in reasonable agreement with the experimental and numerical measurements. The developed phase lag model is then coupled with the semi-analytical model of Lever and Weaver to predict the fluidelastic stability threshold. Improved predictions of the stability boundaries for the parallel triangular array were achieved. In addition, the present study has explained why fluidelastic instability does not occur below some threshold reduced velocity.

  17. A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically reconfigurable processor

    E-print Network

    Arslan, Tughrul

    A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically detection (LSD) multi-input multi-output (MIMO) decoder based on a recently developed novel Reconfigurable and mapped onto our proposed platform. We discuss the targeted K-best LSD algorithm as well as the sorting

  18. Femtosecond laser fabrication of micro/nano-channel array devices for parallelized fluorescence detection

    NASA Astrophysics Data System (ADS)

    Canfield, Brian; Hofmeister, William; Davis, Lloyd

    2013-03-01

    Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. Ultrasensitive, highly parallelized fluorescence-based platforms that incorporate a nano/micro-fluidic chip with an array of closely spaced channels would meet this need. We discuss the use of direct femtosecond laser machining to fabricate prototype fluidic chips with arrays of more than one hundred closely spaced channels. Traditional machining techniques involve overlapping focal spots from many laser pulses while scanning the substrate in order to create channels. However, this procedure is not only lengthy but may allow thermal effects to accumulate that degrade the quality of both the channel profile and surrounding substrate material. We are developing a different method for machining a line with just a single pulse, using a combination of cylindrical lenses and an aspheric lens to reshape a near-Gaussian beam into a tight line focus. Channels on the order of 1 micron wide, 5 microns deep, and nearly 2000 microns long may be made this way. We also address the critical issue of mitigating the high autofluorescence responses that arise from the creation of defects by fs-laser machining in fused silica.

  19. Static properties and current steps in one-dimensional parallel arrays of Josephson tunnel junctions in the presence of a magnetic field

    Microsoft Academic Search

    U. Gambardella; P. Caputo; V. Boffa; G. Celentano; G. Costabile; S. Pace

    1996-01-01

    Static and dynamical states in one-dimensional parallel arrays of small, all refractory Josephson tunnel junctions, are experimentally investigated at 4.2 K. The dependence of the maximum Josephson current on the applied magnetic field in parallel arrays shows asymmetric behavior around secondary maxima which are discussed in terms of a nonuniform applied magnetic field. Current steps and linear branches induced by

  20. Pressure drop and heat transfer of arrays of in-line circular blocks on the wall of parallel channel

    Microsoft Academic Search

    Tamotsu Igarashi; Hajime Nakamura; Taketo Fukuoka

    2004-01-01

    Pressure drop and heat transfer of arrays of in-line circular blocks on the wall of a parallel channel are measured. Diameter and height of the blocks are 40 and 18 mm, respectively, while pitches of the blocks are varied. The effects of the number of lines and rows and other factors on pressure drop and heat transfer are investigated. The

  1. 600 GHz resonant mode in a parallel array of Josephson tunnel junctions connected by superconducting microstrip lines

    Microsoft Academic Search

    Vsevolod K. Kaplunenko; Britt H. Larsen; Jesper Mygind; Niels F. Pedersen

    1995-01-01

    The high frequency properties of the one-dimensional transmission line consisting of a parallel array of resistively shunted Josephson tunnel junctions have been studied in the limit of relatively low damping where this nonlinear system exhibits new and interesting phenomena. Here we report on experimental investigations of a resonant step observed at a voltage corresponding to 600 GHz in the dc

  2. 600 GHz resonant mode in a parallel array of Josephson tunnel junctions connected by superconducting microstrip lines

    Microsoft Academic Search

    V. K. Kaplunenko; Britt H. Larsen; J. Mygind; N. F. Pedersen

    1994-01-01

    The high frequency properties of the one-dimensional transmission line consisting of a parallel array of resistively shunted Josephson tunnel junctions have been studied in the limit of relatively low damping where this nonlinear system exhibits new and interesting phenomena. Here we report on experimental and numerical investigations of a resonant step observed at a voltage corresponding to 600 GHz in

  3. FPGA-Based Coprocessor for Singular Value Array Reconciliation Tomography

    Microsoft Academic Search

    Jack Coyne; David Cyganski; R. James Duckworth

    2008-01-01

    We present an FPGA-based co-processor for accelerating computations associated with Singular Value Array Reconciliation Tomography (SART), a recently developed method for RF source localization. The co-processor allows this relatively complex computational task to be performed using less hardware and less power than would be required by a microprocessor-based computing cluster with comparable throughput and accuracy. The architecture exploits parallelism of

  4. Algorithm Design for Multicore Processors

    E-print Network

    Rand, David

    Algorithm Design for Multicore Processors . . . a high-level approach Peter Krusche Alexander #12;Our Motivation We start with a simple model for designing parallel algorithms. Question 1: Are these algorithms suitable for multicore processors? Question 2: Can our simple algorithms be realized eciently

  5. A Speed-Optimized Systolic Array Processor Architecture for Spatio-Temporal 2-D IIR Broadband Beam Filters

    Microsoft Academic Search

    H. L. P. Arjuna Madanayake; Leonard T. Bruton

    2008-01-01

    For high-speed plane-wave filtering applications, real-time 2-D spatio-temporal linear-array broadband beam filters are required, operating at temporal frame rates in excess of hundreds of megahertz. The corresponding application specific VLSI circuits must have low critical-path latencies. A novel high-speed systolic array architecture for a first-order 2-D broadband frequency-planar spatio-temporal beam filter is proposed for this purpose and employs a field-programmable

  6. PDDP: A data parallel programming model. Revision 1

    SciTech Connect

    Warren, K.H.

    1995-06-01

    PDDP, the Parallel Data Distribution Preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP impelments High Performance Fortran compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the (WRERE?) construct. Distribued data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared-memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  7. Marching Pixels - Using Organic Computing Principles in Embedded Parallel Hardware

    Microsoft Academic Search

    Marcus Komann; Dietmar Fey

    2006-01-01

    We present an organic computing approach for very fast image processing, which we call marching pixels (MPs). Using an embedded massively-parallel array of processor elements (PEs) MPs exploit emergent algorithms in order to solve difficult tasks. They are data packets residing in specific PEs and can be seen as virtual organisms, which are born, move, unite, are mutated, leave signatures

  8. Development of micropump-actuated negative pressure pinched injection for parallel electrophoresis on array microfluidic chip.

    PubMed

    Li, Bowei; Jiang, Lei; Xie, Hua; Gao, Yan; Qin, Jianhua; Lin, Bingcheng

    2009-09-01

    A micropump-actuated negative pressure pinched injection method is developed for parallel electrophoresis on a multi-channel LIF detection system. The system has a home-made device that could individually control 16-port solenoid valves and a high-voltage power supply. The laser beam is excitated and distributes to the array separation channels for detection. The hybrid Glass-PDMS microfluidic chip comprises two common reservoirs, four separation channels coupled to their respective pneumatic micropumps and two reference channels. Due to use of pressure as a driving force, the proposed method has no sample bias effect for separation. There is only one high-voltage supply needed for separation without relying on the number of channels, which is significant for high-throughput analysis, and the time for sample loading is shortened to 1 s. In addition, the integrated micropumps can provide the versatile interface for coupling with other function units to satisfy the complicated demands. The performance is verified by separation of DNA marker and Hepatitis B virus DNA samples. And this method is also expected to show the potential throughput for the DNA analysis in the field of disease diagnosis. PMID:19681052

  9. 480-GMACS\\/mW Resonant Adiabatic Mixed-Signal Processor Array for Charge-Based Pattern Recognition

    Microsoft Academic Search

    Rafal Karakiewicz; Roman Genov; Gert Cauwenberghs

    2007-01-01

    A resonant adiabatic mixed-signal VLSI array delivers 480 GMACS (109 multiply-and-accumulates per second) throughput for every mW of power, a 25-fold improvement over the energy efficiency obtained when resonant clock generator and line drivers are replaced with static CMOS drivers. Losses in resonant clock generation are minimized by activating switches between the LC tank and DC supply with a periodic

  10. Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment

    NASA Astrophysics Data System (ADS)

    Cao, Z.; Walsh, J. L.; Kong, M. G.

    2009-01-01

    This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

  11. The Indirect Binary n-Cube Microprocessor Array

    Microsoft Academic Search

    Marshall C. Pease III

    1977-01-01

    This paper explores the possibility of using a large-scale array of microprocessors as a computational facility for the execution of massive numerical computations with a high degree of parallelism. By microprocessor we mean a processor realized on one or a few semiconductor chips that include arithmetic and logical facilities and some memory. The current state of LSI technology makes this

  12. Parallel detection of harmful algae using reverse transcription polymerase chain reaction labeling coupled with membrane-based DNA array.

    PubMed

    Zhang, Chunyun; Chen, Guofu; Ma, Chaoshuai; Wang, Yuanyuan; Zhang, Baoyu; Wang, Guangce

    2014-03-01

    Harmful algal blooms (HABs) are a global problem, which can cause economic loss to aquaculture industry's and pose a potential threat to human health. More attention must be made on the development of effective detection methods for the causative microalgae. The traditional microscopic examination has many disadvantages, such as low efficiency, inaccuracy, and requires specialized skill in identification and especially is incompetent for parallel analysis of several morphologically similar microalgae to species level at one time. This study aimed at exploring the feasibility of using membrane-based DNA array for parallel detection of several microalgae by selecting five microaglae, including Heterosigma akashiwo, Chaetoceros debilis, Skeletonema costatum, Prorocentrum donghaiense, and Nitzschia closterium as test species. Five species-specific (taxonomic) probes were designed from variable regions of the large subunit ribosomal DNA (LSU rDNA) by visualizing the alignment of LSU rDNA of related species. The specificity of the probes was confirmed by dot blot hybridization. The membrane-based DNA array was prepared by spotting the tailed taxonomic probes onto positively charged nylon membrane. Digoxigenin (Dig) labeling of target molecules was performed by multiple PCR/RT-PCR using RNA/DNA mixture of five microalgae as template. The Dig-labeled amplification products were hybridized with the membrane-based DNA array to produce visible hybridization signal indicating the presence of target algae. Detection sensitivity comparison showed that RT-PCR labeling (RPL) coupled with hybridization was tenfold more sensitive than DNA-PCR-labeling-coupled with hybridization. Finally, the effectiveness of RPL coupled with membrane-based DNA array was validated by testing with simulated and natural water samples, respectively. All of these results indicated that RPL coupled with membrane-based DNA array is specific, simple, and sensitive for parallel detection of microalgae which shows promise for monitoring natural samples in the future. PMID:24338073

  13. The Milstar Advanced Processor

    NASA Astrophysics Data System (ADS)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  14. An associative processor for air traffic control

    Microsoft Academic Search

    Kenneth James Thurber

    1971-01-01

    In recent years associative memories have been receiving an increasing amount of attention. At the same time multiprocessor and parallel processing systems have been under study to solve very large problems. An associative processor is one form of a parallel processor that seems able to provide a cost effective solution to many problems such as the air traffic control (ATC)

  15. Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging

    PubMed Central

    Pang, Yong; Yu, Baiying; Vigneron, Daniel B.

    2014-01-01

    Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than –35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

  16. Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging.

    PubMed

    Pang, Yong; Yu, Baiying; Vigneron, Daniel B; Zhang, Xiaoliang

    2014-02-01

    Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than -35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

  17. Parallel multispot smFRET analysis using an 8-pixel SPAD array

    PubMed Central

    Ingargiola, A.; Colyer, R. A.; Kim, D.; Panzeri, F.; Lin, R.; Gulinatti, A.; Rech, I.; Ghioni, M.; Weiss, S.; Michalet, X.

    2012-01-01

    Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for extracting distance information between two fluorophores (a donor and acceptor dye) on a nanometer scale. This method is commonly used to monitor binding interactions or intra- and intermolecular conformations in biomolecules freely diffusing through a focal volume or immobilized on a surface. The diffusing geometry has the advantage to not interfere with the molecules and to give access to fast time scales. However, separating photon bursts from individual molecules requires low sample concentrations. This results in long acquisition time (several minutes to an hour) to obtain sufficient statistics. It also prevents studying dynamic phenomena happening on time scales larger than the burst duration and smaller than the acquisition time. Parallelization of acquisition overcomes this limit by increasing the acquisition rate using the same low concentrations required for individual molecule burst identification. In this work we present a new two-color smFRET approach using multispot excitation and detection. The donor excitation pattern is composed of 4 spots arranged in a linear pattern. The fluorescent emission of donor and acceptor dyes is then collected and refocused on two separate areas of a custom 8-pixel SPAD array. We report smFRET measurements performed on various DNA samples synthesized with various distances between the donor and acceptor fluorophores. We demonstrate that our approach provides identical FRET efficiency values to a conventional single-spot acquisition approach, but with a reduced acquisition time. Our work thus opens the way to high-throughput smFRET analysis on freely diffusing molecules. PMID:24382989

  18. MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications

    Microsoft Academic Search

    Hartej Singh; Ming-hau Lee; Guangming Lu; Fadi J. Kurdahi; Nader Bagherzadeh; Eliseu M. Chaves Filho

    2000-01-01

    This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated, and reconfigurable system-on-chip, targeted at high-throughput and data-parallel applications. It is comprised of a reconfigurable array of processing cells, a modified RISC processor core, and an efficient memory interface unit. This

  19. Top-down designs of instruction systolic arrays for polynomial interpolation and evaluation

    SciTech Connect

    Schroder, H. (Australian National Univ., Canberra (Australia))

    1989-06-01

    This paper describes the application of a new parallel architecture-instruction systolic array (ISA)-for the interpolation and evaluation of polynomials using a linear array of processors. It also demonstrates a systemic top-down design of instruction systolic arrays. The periods of the resulting algorithms are O(n) for interpolation and O(1)for evaluation, where n is the degree of the polynomial.

  20. Signal processor packaging design

    NASA Astrophysics Data System (ADS)

    McCarley, Paul L.; Phipps, Mickie A.

    1993-10-01

    The Signal Processor Packaging Design (SPPD) program was a technology development effort to demonstrate that a miniaturized, high throughput programmable processor could be fabricated to meet the stringent environment imposed by high speed kinetic energy guided interceptor and missile applications. This successful program culminated with the delivery of two very small processors, each about the size of a large pin grid array package. Rockwell International's Tactical Systems Division in Anaheim, California developed one of the processors, and the other was developed by Texas Instruments' (TI) Defense Systems and Electronics Group (DSEG) of Dallas, Texas. The SPPD program was sponsored by the Guided Interceptor Technology Branch of the Air Force Wright Laboratory's Armament Directorate (WL/MNSI) at Eglin AFB, Florida and funded by SDIO's Interceptor Technology Directorate (SDIO/TNC). These prototype processors were subjected to rigorous tests of their image processing capabilities, and both successfully demonstrated the ability to process 128 X 128 infrared images at a frame rate of over 100 Hz.

  1. a Study of Ultrasonic Wave Propagation Through Parallel Arrays of Immersed Tubes

    NASA Astrophysics Data System (ADS)

    Cocker, R. P.; Challis, R. E.

    1996-06-01

    Tubular array structures are a very common component in industrial heat exchanging plant and the non-destructive testing of these arrays is essential. Acoustic methods using microphones or ultrasound are attractive but require a thorough understanding of the acoustic properties of tube arrays. This paper details the development and testing of a small-scale physical model of a tube array to verify the predictions of a theoretical model for acoustic propagation through tube arrays developed by Heckl, Mulholland, and Huang [1-5] as a basis for the consideration of small-scale physical models in the development of non-destructive testing procedures for tube arrays. Their model predicts transmission spectra for plane waves incident on an array of tubes arranged in straight rows. Relative transmission is frequency dependent with bands of high and low attenuation caused by resonances within individual tubes and between tubes in the array. As the number of rows in the array increases the relative transmission spectrum becomes more complex, with increasingly well-defined bands of high and low attenuation. Diffraction of acoustic waves with wavelengths less than the tube spacing is predicted and appears as step reductions in the transmission spectrum at frequencies corresponding to integer multiples of the tube spacing. Experiments with the physical model confirm the principle features of the theoretical treatment.

  2. Hardware multiplier processor

    DOEpatents

    Pierce, Paul E. (Albuquerque, NM)

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  3. 3D optical interconnect mesh network for on-board parallel multiprocessor system based on EOPCB

    NASA Astrophysics Data System (ADS)

    Luo, Fengguang; Cao, Mingcui; Zhou, Xinjun; Xu, Jun; Luo, Zhixiang; Yuan, Jing; Zong, Liangjia; Feng, Yonghua; Chen, Chao; Zhang, Conghui

    2007-11-01

    A three-dimensional (3-D) 4×4×4 optical interconnect Mesh network scheme for parallel multiprocessor system based on polymer light waveguide electro-optical printed circuit board(EOPCB) is proposed in this paper. The Mesh topological structures of light waveguide interconnects for processor element chip-to-chip on a board, and board-toboard on backplane is constructed. The system consists of 64 processor element chips interconnected in a 3-D Mesh network configuration. Every processor board comprises 4x4 processor element chips with Mesh interconnection. Board-to-board Mesh interconnects are established on a backplane through light waveguide Mesh interconnect topological structure. An additional optical layer with light waveguide structure is used in conventional PCB to construct EOPCB. Vertical cavity surface emitting laser (VCSEL) array is used as optical transmitter array. PIN photodiode array is used as optical receiver array. A MT-compatible direct coupling method is presented to couple light beam between optical transmitter/receiver with light waveguide layer. The optical signals from a processor element chip on a board can transmit to another processor element chip on another board through light waveguide interconnection in the backplane. So 3-D optical interconnection Mesh network for parallel multiprocessor system can be reailzed by EOPCB.

  4. Biological Information Signal Processor

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Peterson, John C.; Yoo, Michael M.

    1993-01-01

    Biological Information Signal Processor (BISP) is computing system analyzing data on deoxyribonucleic acid (DNA) sequences for molecular genetic analysis. Includes coprocessors, specialized microprocessors complementing present and future computers by performing rapidly most-time-consuming DNA-sequence-analyzing functions, establishing relationships (alignments) between both global sequences and defining patterns in multiple sequences. Also includes state-of-art software and data-base systems on both conventional and parallel computer systems to augment analytical abilities of developmental coprocessors.

  5. Tiled Multicore Processors

    NASA Astrophysics Data System (ADS)

    Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

    For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

  6. Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding

    E-print Network

    Kasahara, Hironori

    . For example, Polaris[1] compiler exploits loop level parallelism by using symbolic analysis, runtime data the performance. Loop restructurings such as loop permutation, loop fusion and tiling to change data access and maintain loop parallelism has been used to enhance data locality[8]. Furthermore, after loop fusion

  7. Achromatic wave plate in THz frequency region based on parallel metal plate waveguides with a pillar array.

    PubMed

    Nagai, Masaya; Mukai, Noriyuki; Minowa, Yosuke; Ashida, Masaaki; Suzuki, Takehito; Takayanagi, Jun; Ohtake, Hideyuki

    2015-02-23

    We demonstrated an achromatic wave plate based on parallel metal plate waveguides in the high THz frequency region. The metal plates have periodic rough structures on the surface, which allow slow transverse magnetic wave propagation and fast transverse electric wave propagation. A numerical simulation showed that the height of the periodic roughness is important for optimizing the birefringence. We fabricated stacked metal plates containing two types of structures by chemical etching. An array of small pillars on the metal plates allows higher frequency optimization. We experimentally demonstrated an achromatic quarter-wave plate in the frequency region from 2.0 to 3.1 THz. PMID:25836501

  8. Parallel nanomanufacturing via electrohydrodynamic jetting from microfabricated externally-fed emitter arrays

    NASA Astrophysics Data System (ADS)

    Ponce de Leon, Philip J.; Hill, Frances A.; Heubel, Eric V.; Velásquez-García, Luis F.

    2015-06-01

    We report the design, fabrication, and characterization of planar arrays of externally-fed silicon electrospinning emitters for high-throughput generation of polymer nanofibers. Arrays with as many as 225 emitters and with emitter density as large as 100 emitters cm?2 were characterized using a solution of dissolved PEO in water and ethanol. Devices with emitter density as high as 25 emitters cm?2 deposit uniform imprints comprising fibers with diameters on the order of a few hundred nanometers. Mass flux rates as high as 417 g hr?1 m?2 were measured, i.e., four times the reported production rate of the leading commercial free-surface electrospinning sources. Throughput increases with increasing array size at constant emitter density, suggesting the design can be scaled up with no loss of productivity. Devices with emitter density equal to 100 emitters cm?2 fail to generate fibers but uniformly generate electrosprayed droplets. For the arrays tested, the largest measured mass flux resulted from arrays with larger emitter separation operating at larger bias voltages, indicating the strong influence of electrical field enhancement on the performance of the devices. Incorporation of a ground electrode surrounding the array tips helps equalize the emitter field enhancement across the array as well as control the spread of the imprints over larger distances.

  9. A novel polymeric microelectrode array for highly parallel, long-term neuronal culture and stimulation

    E-print Network

    Talei Franzesi, Giovanni

    2008-01-01

    Cell-based high-throughput screening is emerging as a disruptive technology in drug discovery; however, massively parallel electrical assaying of neurons and cardiomyocites has until now been prohibitively expensive. To ...

  10. Dynamically reconfigurable optical morphological processor and its applications

    NASA Technical Reports Server (NTRS)

    Chao, Tien-Hsin

    1993-01-01

    An innovative optically implemented morphological processor is introduced. With the use of a large space-bandwidth-product Dammann grating and a high-speed shutter spatial light modulator, effective structuring element with large size and arbitrary shape can be constructed with dynamic reconfigurability. This reconfigurability is a major improvement over the conventional correlator-based morphological processor in which fixed holographic filters are used as structuring elements (Casasent and Botha, 1988). A novel two-dimensional thresholding photodetector array, capable of performing parallel thresholding and feedback, is utilized in this system and makes possible the implementation of many complex morphological operations requiring iterative feedbacks and full programmability. The optical architecture and the principle of operation are presented. Experimental demonstration of binary image morphological erosion, dilation, opening, and closing are also demonstrated. A technique for extending this technique to gray-scale image using thresholding decomposition technique is also discussed.

  11. Supercomputing on massively parallel bit-serial architectures

    NASA Technical Reports Server (NTRS)

    Iobst, Ken

    1985-01-01

    Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.

  12. Development and characterization of hollow microprobe array as a potential tool for versatile and massively parallel manipulation of single cells.

    PubMed

    Nagai, Moeto; Oohara, Kiyotaka; Kato, Keita; Kawashima, Takahiro; Shibata, Takayuki

    2015-04-01

    Parallel manipulation of single cells is important for reconstructing in vivo cellular microenvironments and studying cell functions. To manipulate single cells and reconstruct their environments, development of a versatile manipulation tool is necessary. In this study, we developed an array of hollow probes using microelectromechanical systems fabrication technology and demonstrated the manipulation of single cells. We conducted a cell aspiration experiment with a glass pipette and modeled a cell using a standard linear solid model, which provided information for designing hollow stepped probes for minimally invasive single-cell manipulation. We etched a silicon wafer on both sides and formed through holes with stepped structures. The inner diameters of the holes were reduced by SiO2 deposition of plasma-enhanced chemical vapor deposition to trap cells on the tips. This fabrication process makes it possible to control the wall thickness, inner diameter, and outer diameter of the probes. With the fabricated probes, single cells were manipulated and placed in microwells at a single-cell level in a parallel manner. We studied the capture, release, and survival rates of cells at different suction and release pressures and found that the cell trapping rate was directly proportional to the suction pressure, whereas the release rate and viability decreased with increasing the suction pressure. The proposed manipulation system makes it possible to place cells in a well array and observe the adherence, spreading, culture, and death of the cells. This system has potential as a tool for massively parallel manipulation and for three-dimensional hetero cellular assays. PMID:25749639

  13. Analog Processor To Solve Optimization Problems

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

    1993-01-01

    Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.

  14. Parallel acquisition of Raman spectra from a 2D multifocal array using a modulated multifocal detection scheme

    NASA Astrophysics Data System (ADS)

    Kong, Lingbo; Chan, James W.

    2015-03-01

    A major limitation of spontaneous Raman scattering is its intrinsically weak signals, which makes Raman analysis or imaging of biological specimens slow and impractical for many applications. To address this, we report the development of a novel modulated multifocal detection scheme for simultaneous acquisition of full Raman spectra from a 2-D m × n multifocal array. A spatial light modulator (SLM), or a pair of galvo-mirrors, is used to generate m × n laser foci. Raman signals generated within each focus are projected simultaneously into a spectrometer and detected by a CCD camera. The system can resolve the Raman spectra with no crosstalk along the vertical pixels of the CCD camera, e.g., along the entrance slit of the spectrometer. However, there is significant overlap of the spectra in the horizontal pixel direction, e.g., along the dispersion direction. By modulating the excitation multifocal array (illumination modulation) or the emitted Raman signal array (detection modulation), the superimposed Raman spectra of different multifocal patterns are collected. The individual Raman spectrum from each focus is then retrieved from the superimposed spectra using a postacquisition data processing algorithm. This development leads to a significant improvement in the speed of acquiring Raman spectra. We discuss the application of this detection scheme for parallel analysis of individual cells with multifocus laser tweezers Raman spectroscopy (M-LTRS) and for rapid confocal hyperspectral Raman imaging.

  15. Communication efficient parallel algorithms for nonnumerical computations

    SciTech Connect

    Doshi, K.A.

    1988-01-01

    The broad goal of this research is to develop a set of paradigms for mapping data-dependent symbolic computations on realistic models of parallel architectures. Within this goal, the thesis represents the initial effort to achieve efficient parallel solutions for a number of non-numerical problems on networks of processors. The specific contributions of the thesis are new parallel algorithms, exhibiting linear speedup on architectures consisting of fixed numbers of processors (i.e., bounded models). The following problems have been considered in the thesis: (1) Determine the minimum spanning tree (MST), and identify the bridges and articulation points (APs) of an undirected weighted graph represented by an n x n adjacency matrix. (2) The pattern matching problem: Given two strings of characters, of lengths m and n ({number sign}m) respectively, mark all positions in the second string where there appears an instance of the first string. (3) Sort n elements. For each problem, the author uses a processor-network consisting of p processors. The network model used in the solution of the first set of problems is the linear array; while that used in the solutions of the second and third problems is a butterfly-connected system. The solutions on the butterfly-connected system apply also on a pipelined hypercube. The performances of the solutions are summarized.

  16. Accuracy Limitations in Optical Linear Algebra Processors

    Microsoft Academic Search

    Stephen Gordon Batsell

    1990-01-01

    One of the limiting factors in applying optical linear algebra processors (OLAPs) to real-world problems has been the poor achievable accuracy of these processors. Little previous research has been done on determining noise sources from a systems perspective which would include noise generated in the multiplication and addition operations, noise from spatial variations across arrays, and from crosstalk. In this

  17. EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR

    E-print Network

    Chu, Pong P.

    a processor, memory modules, I/O periph- erals, and custom hardware accelerators into a single integrated circuit. As the capacity of FPGA (field-programmable gate array) devices continues to grow, the same-sized processor and off-the-shelf peripherals and the software is customized to implement the desired

  18. High speed IP lookup algorithm with scalability and parallelism based on CAM array and TCAM

    Microsoft Academic Search

    Tan Mingfeng; Gong Zhenghu

    2004-01-01

    With fast increasing of Internet's bandwidth, higher performance routers are needed, and their speed depends a lot on the IP routing lookup process. This paper proposes a high performance IP routing lookup algorithm based on CAM array and TCAM. Using smaller memory it does faster searching. For real route tables with 128K prefixes this scheme needs only 17 CAMs with

  19. Eight-Channel Head Array and Control System for Parallel Transmit/Receive Magnetic Resonance Imaging

    E-print Network

    Moody, Katherine

    2014-08-11

    system was programmed in LabVIEW using off-the-shelf hardware to manage pulse playback, correct transmit chain non-linearities, monitor on-coil waveforms, and drive the transmit hardware. The transmit array was shown with well-isolated patterns...

  20. Parallel operation of a 32 x 16 symmetric self-electrooptic effect device array

    NASA Astrophysics Data System (ADS)

    McCormick, F. B.; Lentine, A. L.; Morrison, R. L.; Walker, S. L.; Chirovsky, L. M. F.

    1991-03-01

    FREE-SPACE optics for digital optical computing or for electrooptic interconnections is considered. Results of an experiment are presented in which 512 symmetric self-electrooptic effect devices (S-SEEDs) were simultaneously operated. In this experiment, simultaneous continuous bistable operation was shown for the 32 x 16 array of S-SEEDs, as well as simultaneously optical latching of optical data in a random access manner onto the array. Each S-SEED received two inputs and reflected two outputs; thus, the total number of pinouts demonstrated was 2048. Each S-SEED consists of two multiple-quantum-well modulators (the S and R modulators) connected in series. Two AlGaAs laser diodes with output powers of 4 mW were used. One laser was used to drive all of the S modulators, and the other laser was used to drive the R modulators. After collimation, the outputs of the lasers were combined on a knife-edge mirror using the optical isolator arrangement. This pair of beams was replicated 512 times (32 x 16) by a two-dimensional binary phase grating (BPG), forming an array of beamlets. Some optical issues in the operation of large optical switching device arrays are discussed.

  1. AIFSP: An Adaptive Instruction Flow Stream Processor

    Microsoft Academic Search

    Yaohua Wang; Shuming Chen; Jianghua Wan; Kai Zhang; Shenggang Chen

    2011-01-01

    Stream processor is efficient for media applications as it exploits the features of media processing, such as data parallelism, producer-consumer locality and so on. However, the loosely coupled structure between host and stream processor makes the communication between scalar and SIMD part costly and scheduling across kernels less flexible. Besides, the kernel loading time adds additional cost. When the stream

  2. Hierarchical Checking of Multiprocessors Using Watchdog Processors

    E-print Network

    Fey, Dietmar

    program; thus, in the watchdog processor neither a reference database nor a time-consuming search and compare engine is required. 1 Introduction Massively parallel computing systems running computing or a special WP pro- gram of signature evaluation instructions.(In [6] the main processor itself emulates

  3. Data parallel algorithms

    Microsoft Academic Search

    W. Daniel Hillis; Guy L. Steele Jr.

    1986-01-01

    Parallel computers with tens of thousands of processors are typically programmed in a data parallel style, as opposed to the control parallel style used in multiprocessing. The success of data parallel algorithms—even on problems that at first glance seem inherently serial—suggests that this style of programming has much wider applicability than was previously thought.

  4. Introduction to Parallel Programming

    E-print Network

    Introduction to Parallel Programming 1Tuesday, April 17, 12 #12;Overview · Parallel programming allows the user to use multiple cpus concurrently · Reasons for parallel execution: · shorten execution expect as a function of the number of processors (N) used and the code fraction that is parallel (p). T(1

  5. Interactive animation of fault-tolerant parallel algorithms

    SciTech Connect

    Apgar, S.W.

    1992-02-01

    Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.

  6. Pin-Hole Array Correlation Imaging: Highly Parallel Fluorescence Correlation Spectroscopy

    PubMed Central

    Needleman, Daniel J.; Xu, Yangqing; Mitchison, Timothy J.

    2009-01-01

    Abstract In this work, we describe pin-hole array correlation imaging, a multipoint version of fluorescence correlation spectroscopy, based upon a stationary Nipkow disk and a high-speed electron multiplying charged coupled detector. We characterize the system and test its performance on a variety of samples, including 40 nm colloids, a fluorescent protein complex, a membrane dye, and a fluorescence fusion protein. Our results demonstrate that pin-hole array correlation imaging is capable of simultaneously performing tens or hundreds of fluorescence correlation spectroscopy-style measurements in cells, with sufficient sensitivity and temporal resolution to study the behaviors of membrane-bound and soluble molecules labeled with conventional chemical dyes or fluorescent proteins. PMID:19527665

  7. Adaptive antenna processor test results

    Microsoft Academic Search

    J. R. Hamalainen; J. W. Howland

    1991-01-01

    Summary form only given. A two-element adaptive array processor (AAP) was designed, built and tested. The primary purpose of this development was to demonstrate the operational airborne utility and to quantify the amount of jam resistance improvement offered by the AAP when operating in conjunction with a tactical anti-jam (AJ) radio set employing fast frequency hopping. Performance was measured with

  8. Large Area Parallel Surface Nanostructuring with Laser Irradiation Through Microlens Arrays

    Microsoft Academic Search

    C. S. Lim; M. H. Hong; Y. Lin; L. S. Tan; A. Senthil Kumar; M. Rahman

    2010-01-01

    In the past decade, the development of nanoelectronics and nano-optics has attracted much interest in surface nanostructuring of semiconductor materials. The irradiation of a microlens array by a laser beam generates many focused light spots, which can act as a direct writing tool on photo-polymer materials. This maskless surface nanostructuring technique enables thousands to millions of identical nano-features to be

  9. LARGE AREA PARALLEL SURFACE NANOSTRUCTURING WITH LASER IRRADIATION THROUGH MICROLENS ARRAYS

    Microsoft Academic Search

    C. S. LIM; M. H. HONG; Y. LIN; L. S. TAN; A. SENTHIL KUMAR; M. RAHMAN

    2010-01-01

    In the past decade, the development of nanoelectronics and nano-optics has attracted much interest in surface nanostructuring of semiconductor materials. The irradiation of a microlens array by a laser beam generates many focused light spots, which can act as a direct writing tool on photo-polymer materials. This maskless surface nanostructuring technique enables thousands to millions of identical nano-features to be

  10. High performance selectively oxidized VCSELs and arrays for parallel high-speed optical interconnects

    Microsoft Academic Search

    F. Mederer; M. Grabherr; F. Eberhard; I. Ecker; R. Jager; J. Joos; C. Jung; M. Kicherer; R. King; P. Schnitzer; H. Unold; D. Wiedenmann; K. J. Ebeling

    2000-01-01

    We introduce a new layout for high-bandwidth single-mode selectively oxidized vertical-cavity surface-emitting laser (VCSEL) arrays operating at 980 nm or 850 nm emission wavelength for substrate or epitaxial side emission. Coplanar feeding lines and polyimide passivation are used to reduce electrical parasitics in top-emitting GaAs and bottom-emitting InGaAs VCSELs. In order to enhance fundamental single-mode emission for larger devices of

  11. Database Reorganization in Parallel Disk Arrays with I/O Service Stealing

    NASA Technical Reports Server (NTRS)

    Zabback, Peter; Onyuksel, Ibrahim; Scheuermann, Peter; Weikum, Gerhard

    1996-01-01

    We present a model for data reorganization in parallel disk systems that is geared towards load balancing in an environment with periodic access patterns. Data reorganization is performed by disk cooling, i.e. migrating files or extents from the hottest disks to the coldest ones. We develop an approximate queueing model for determining the effective arrival rates of cooling requests and discuss its use in assessing the costs versus benefits of cooling.

  12. Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems

    PubMed Central

    Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

    2011-01-01

    A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

  13. Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis

    NASA Astrophysics Data System (ADS)

    Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

    2013-11-01

    Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol.

  14. Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis.

    PubMed

    Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

    2013-12-01

    Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol. PMID:24129356

  15. Parallel fabrication of electrode arrays on single-walled carbon nanotubes using dip-pen-nanolithography-patterned etch masks.

    PubMed

    Park, Steve; Wang, Wechung Maria; Bao, Zhenan

    2010-05-01

    This article presents a novel application of using dip-pen nanolithography (DPN) to fabricate Au electrodes concurrently in a high-throughput fashion through an etch resist. We have fabricated 26 pairs of electrodes, where cleanly etched electrode architectures, along with a high degree of feature-size controllability and tip-to-tip uniformity, were observed. Moreover, electrode gaps in the sub-100-nm regime have been successfully fabricated. Conductivity measurements of multiple electrodes in the array were all comparable to that of bulk Au, confirming the reliability and the low-resistance property of the electrodes. Finally, as a demonstration of electrode functionality, SWNT devices were fabricated and the electrical properties of an SWNT device were measured. Hence, our experimental results validate DPN as an effective tool in generating high-quality electrodes in a parallel manner with mild, simple processing steps at a relatively low cost. PMID:20163131

  16. Customization of application specific heterogeneous multi-pipeline processors

    Microsoft Academic Search

    Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

    2006-01-01

    In this paper we propose application specific instruction set processors with heterogeneous multiple pipelines to efficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and

  17. Load follow simulation of three-dimensional boiling water reactor core by PACS32 parallel microprocessor system

    Microsoft Academic Search

    T. Hoshino; T. Shirakawa

    1982-01-01

    The three-dimensional boiling water reactor (BWR) core following the daily load was simulated by the use of the processor array for continuum simulation (PACS-32), a newly developed parallel microprocessor system. The PACS system consists of 32 processing units (PUs) (microprocessors) and has a multiinstruction, multidata type architecture, being optimum to the numerical simulation of the partial differential equations. The BWR

  18. Magnetic arrays

    DOEpatents

    Trumper, D.L.; Kim, W.; Williams, M.E.

    1997-05-20

    Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

  19. Magnetic arrays

    DOEpatents

    Trumper, David L. (Plaistow, NH); Kim, Won-jong (Cambridge, MA); Williams, Mark E. (Pelham, NH)

    1997-05-20

    Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

  20. FFT Computation with Systolic Arrays, A New Architecture

    NASA Technical Reports Server (NTRS)

    Boriakoff, Valentin

    1994-01-01

    The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.

  1. Orthogonal and parallel lattice plasmon resonance in core-shell SiO(2)/Au nanocylinder arrays.

    PubMed

    Lin, Linhan; Yi, Yasha

    2015-01-12

    Height induced coupling behavior between the plasmonic modes and diffraction orders were studied in the core-shell SiO(2)/Au nanocylinder arrays (NCAs) using finite difference time domain (FDTD) simulations. New lattice plasmon modes (LPMs) are observed in the structures with high aspect ratio. Specifically, parallel coupling between the plasmonic modes and diffraction orders is obtained here, which shows different coupling behavior from orthogonal LPMs. Electromagnetic (EM) field distributions indicate that horizontal propagation of the magnetic or electric field component is responsible for the generation of these orthogonal and parallel LPMs, respectively. Radiative loss could be effectively suppressed when the height increases. This is important for the applications of fluorescence enhancement and nano laser. Further studies confirm that the LPMs associated with the superstrate diffraction orders could be well maintained even when the Au coating is imperfect. The interference from the substrate associated LPMs could be eliminated by cutting off the corresponding diffraction waves by inducing a Si(3)N(4) substrate. This study of coupling behavior in the core-shell NCAs enables a novel route to design and optimize the LPMs for applications of bio-sensing and nano laser. PMID:25835660

  2. Parallel multi-step nanolithography by nanoscale Cu-covered h-PDMS tip array

    NASA Astrophysics Data System (ADS)

    Chang, Yuan-Jen; Huang, Han-Kuan

    2014-09-01

    Tip-based nanolithography provides a flexible nanolithographic technology. Tip fabrication is one of the main challenges. In this paper, we propose to combine the dry etching of photoresist and electro-chemical machining to reduce the size of the tip opening. We successfully fabricate a tip opening with a diameter of 200?nm. After lithography and lift-off, gold dot patterns with a diameter of 280?nm are demonstrated. Moreover, a home-made multi-step exposure system is built and both the successful 14- and 44-step nanolithography by a tip array are also demonstrated in the paper.

  3. A General-Purpose CMOS Vision Chip with a Processor-Per-Pixel SIMD Array Piotr Dudek and Peter J. Hicks

    E-print Network

    Dudek, Piotr

    dissipation. The prototype 21×21 SCAMP vision chip is fabricated in a 0.6µm CMOS technology and achieves, using a smart-sensor device. Some simple low-level image processing tasks can be implemented using few bits of memory per processor) they can be hardly considered "general-purpose". There have been

  4. Scalability and Communication in Parallel Low-Complexity Lossless Compression

    Microsoft Academic Search

    Luigi Cinque; Sergio De Agostino; Luca Lombardi

    2010-01-01

    Approximation schemes for optimal compression with static and sliding dictionaries which can run on a simple array of processors\\u000a with distributed memory and no interconnections are presented. These approximation algorithms can be implemented on both small\\u000a and large scale parallel systems. The sliding dictionary method requires large size files on large scale systems. As far as\\u000a lossless image compression is

  5. Eigenvalue computation of large symmetric tridiagonal matrices on concurrent processors

    NASA Technical Reports Server (NTRS)

    Chang, H. Y.; Utku, S.; Salama, M.

    1988-01-01

    Symmetric tridiagonal eigenvalue problems may arise indirectly in structural dynamic analysis. An algorithm for eigenvalue computation of large symmetric tridiagonal matrices on concurrent processors to meet the challenge of the new emerging computer hardware technology is presented. A standard bisection method in conjunction with Sylvester's Theorem is chosen to be converted into a parallel N-section algorithm. This parallel algorithm takes advantage of the multi-processor environment by carrying out N (number of processors) triangular factorizations of chosen shifted matrices in all processors concurrently and by minimizing communication between processors. The algorithm is designed for local-memory concurrent processors, i.e. message passing type processors. The efficiency and speed-up are given in terms of problem and machine parameters. The algorithm is very efficient when both the number of processors and the number of eigenvalues to be extracted are much smaller than the order of the tridiagonal matrix.

  6. Current research in parallel microprocessing systems at Los Alamos

    SciTech Connect

    Ethridge, C.D.

    1984-05-02

    The Computing and Communications Division at the Los Alamos National Laboratory has designed and is building a parallel microprocessor system (PuPS) to serve as a research tool for evaluating parallel processing of large-scale scientific codes. PuPS is an experimental architecture consisting of an orthogonal array of 20 processing elements by 32 memory elements, establishing a tightly coupled, shared-memory (16-Mbyte) machine. The hardware incorporates VLSI components, such as 16-bit microprocessors, floating-point co-processors, and dynamic random access memories. The design replaces conventional MSI/SSI circuitry with programmable array logic, logic sequencers, and logic arrays. This experimental system, which is only 1 element of the parallel processing research being done by the Laboratory's Computing and Communications Division, will enable direct comparisons of speedups of algorithms for a variety of multiprocessor architectures.

  7. IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 41, NO. 3, MARCH 1993 Fig. 5. The linear array for parallel implementations without feedback

    E-print Network

    Botea, Adi

    Hoog, "QR factorization of Toeplitz matrices," Numer. Math., vol. 49, pp. 81-94, 1986. D. R. Sweet. L. Scharf, "Fast algorithms for computing QR and Cholesky factors of Toeplitz operators," fEEE Trans, and F. R. de Hoog. "Linearly con- nected arrays for Toeplitz least squares problems," J. Parallel Dis

  8. An efficient parallel algorithm for O(N^2) direct summation method and its variations on distributed-memory parallel machines

    E-print Network

    Junichiro Makino

    2001-08-27

    We present a novel, highly efficient algorithm to parallelize O(N^2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in which all processors have complete copies of the N-body system, has the serious problem that the communication-computation ratio increases as we increase the number of processors, since the communication cost is independent of the number of processors. In the new algorithm, p processors are organized as a $\\sqrt{p}\\times \\sqrt{p}$ two-dimensional array. Each processor has $N/\\sqrt{p}$ particles, but the data are distributed in such a way that complete system is presented if we look at any row or column consisting of $\\sqrt{p}$ processors. In this algorithm, the communication cost scales as $N /\\sqrt{p}$, while the calculation cost scales as $N^2/p$. Thus, we can use a much larger number of processors without losing efficiency compared to what was practical with previously known algorithms.

  9. [6] H. Meijer and S. G. Akl. Optimal computation of prefix sums on a binary tree of processors. International Journal of Parallel Programming, 16:127--136, 1987.

    E-print Network

    Plaxton, Charles Gregory

    in this paper. Namely, Cypher and Plaxton [4] have developed a hypercube sorting algorithm with running time O International Conference on Parallel Processing, pages 355--362, 1987. [4] R. E. Cypher and C. G. Plaxton

  10. VLSI high speed packet processor

    NASA Technical Reports Server (NTRS)

    Grebowsky, Gerald J.; Dominy, Carol T.

    1988-01-01

    The Goddard Space Flight Center Mission Operations and Data Systems Directorate has developed a packet processor card utilizing semicustom very large scale integration (VLSI) devices, microprocessors, and programmable gate arrays to support the implementation of multichannel telemetry data capture systems. This card will receive synchronized error corrected telemetry transfer frames and output annotated application packets derived from this data. An adaptable format capability is provided by the programmability of three microprocessors while the throughput capability of the packet processor is achieved by a data pipeline consisting of two separate RAM systems controlled by specially designed semicustom VLSI logic.

  11. Machine-Description Driven Compilers for EPIC and VLIW Processors

    Microsoft Academic Search

    B. RAMAKRISHNA RAU; VINOD KATHAIL; SHAIL ADITYA

    1999-01-01

    In the past, due to the restricted gate count available on an inexpensive chip, embedded DSPs have had limited parallelism, few registers and irregular, incomplete interconnectivity. More recently, with increasing levels of integration, embedded VLIW processors have started to appear. Such processors typically have higher levels of instruction-level parallelism, more registers, and a relatively regular interconnect between the registers and

  12. Electrically reconfigurable logic array

    NASA Technical Reports Server (NTRS)

    Agarwal, R. K.

    1982-01-01

    To compose the complicated systems using algorithmically specialized logic circuits or processors, one solution is to perform relational computations such as union, division and intersection directly on hardware. These relations can be pipelined efficiently on a network of processors having an array configuration. These processors can be designed and implemented with a few simple cells. In order to determine the state-of-the-art in Electrically Reconfigurable Logic Array (ERLA), a survey of the available programmable logic array (PLA) and the logic circuit elements used in such arrays was conducted. Based on this survey some recommendations are made for ERLA devices.

  13. Multi-Processor For Vision

    NASA Astrophysics Data System (ADS)

    Fehervari, I.; Lambrechts, P.; Baetens, E.; Oosterlinck, A.

    1987-10-01

    A multiprocessor system for fast Image Processing (IP) is presented with two categories of (high-speed) processors, supervised by a general purpose microprocessor: the AD-p is optimized for pixel address calculations and the DA-p is optimized for integer arithmetic. Due to a high hierarchical structure, problems can be split up into smaller (independent,parallel) subtasks which are easy to implement. The resulting AD-DA configuration can be used as a powerful general purpose IP system. Since both processors are implemented each on a single board, they offer a valuable solution for diverse industrial vision applications.

  14. CFHT's Generation III Controller: a multiamplifier focal plane array readout system

    NASA Astrophysics Data System (ADS)

    Kerr, John M.; Clark, Christopher C.; Smith, S. S.

    1994-06-01

    To manage the increased size of focal plane arrays (FPA) and the accompanying data volume increase, we have developed a detector control system that allows parallel readout of multiple array amplifiers. This system consists of a data acquisition computer running the Pegasus Software System on a UNIX platform, an interface control computer and a digital signal processor (DSP) to directly control the array. At the heart of the system is a DSP-based controller (Leach SDSU) which serves as a waveform generator and video processor. It is completely software programmable, making it quite adaptable to a host of FPA designs. We have configured our system into a four channel system, reading pixel data out of all four quadrants of an array. Image data is descrambled in the interface computer, producing a sensible image on the image display. We have applied this system to control and readout of large format CCDs and to IR arrays.

  15. Upset Characterization of the PowerPC405 Hard-core Processor Embedded in Virtex-II Pro Field Programmable Gate Arrays

    NASA Technical Reports Server (NTRS)

    Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez

    2006-01-01

    Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.

  16. Upset Characterization and Test Methodology of the PowerPC405 Hard-Core Processor Embedded in Xilinx Field Programmable Gate Arrays

    Microsoft Academic Search

    Gregory R. Allen; Gary M. Swift; Greg Miller

    2007-01-01

    Pseudo-static upset results for memory elements in the PPC405 core embedded in a 1.5 V, 130 nm Virtex-II Pro FPGA are compared to the PPC405 core embedded in a 1.2 V, 90 nm Virtex-4 FX FPGA. The results show consistency with earlier PowerPC processor measurements and illuminate scaling trends. While details vary, the upsetable elements consistently yield very low thresholds

  17. Column-selection-enabled 8T SRAM array with ?1R\\/1W multi-port operation for DVFS-enabled processors

    Microsoft Academic Search

    Sang Phill Park; Soo Youn Kim; Dongsoo Lee; Jae-Joon Kim; W. Paul Griffin; Kaushik Roy

    2011-01-01

    In this work, we propose a new multi-port 8T SRAM architecture suitable for DVFS enabled processors. With multi- way caches using 8T SRAM, write-back operations are required to support column selection. While conventional write-back schemes may not have the 1R\\/1W dual port advantage of 8T SRAM, our proposed local write-back scheme preserves both ports with only minimal limitations. Simulation results

  18. Standard Templates Adaptive Parallel Library

    E-print Network

    Arzu, Francisco Jose

    2000-01-01

    STAPL (Standard Templates Adaptive Parallel Library) is a parallel C++ library designed as a superset of the C++ Standard Template Library (STL), sequentially consistent for functions with the same name, and executed on uni- or multi- processor...

  19. Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors

    SciTech Connect

    Aghili Yajadda, Mir Massoud [CSIRO Manufacturing Flagship, P.O. Box 218, Lindfield NSW 2070 (Australia)

    2014-10-21

    We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV–50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

  20. Contextual classification on a CDC Flexible Processor system. [for photomapped remote sensing data

    NASA Technical Reports Server (NTRS)

    Smith, B. W.; Siegel, H. J.; Swain, P. H.

    1981-01-01

    A potential hardware organization for the Flexible Processor Array is presented. An algorithm that implements a contextual classifier for remote sensing data analysis is given, along with uniprocessor classification algorithms. The Flexible Processor algorithm is provided, as are simulated timings for contextual classifiers run on the Flexible Processor Array and another system. The timings are analyzed for context neighborhoods of sizes three and nine.

  1. Pthreads for Dynamic Parallelism

    Microsoft Academic Search

    Girija J. Narlikar; Guy E. Blelloch

    1998-01-01

    Expressing a large number of lightweight, parallel threads in a shared address space significantly eases the task of writing a parallel program. Threads can be dynamically created to execute individual parallel tasks; the implementation schedules these threads onto the processors and effectively balances the load. However, unless the threads scheduler is designed carefully, such a p arallel program may suffer

  2. Parallel architectures for iterative methods on adaptive, block structured grids

    NASA Technical Reports Server (NTRS)

    Gannon, D.; Vanrosendale, J.

    1983-01-01

    A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.

  3. Algorithmically specialized parallel computers

    SciTech Connect

    Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.

    1985-01-01

    This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.

  4. PROCESSOR SUPPORT: Most ARM-based processors such as

    E-print Network

    Narasayya, Vivek

    Instruments Jacinto processors · Renesas SH4-based processors such as the SH7785 · Intel iA86-based processors's i.MX31 and i.MX35 · Texas Instruments Jacinto processors · Renesas SH4-based processors

  5. Opto-electronic morphological processor

    NASA Technical Reports Server (NTRS)

    Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

    1993-01-01

    The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

  6. High performance parallel computers for science: New developments at the Fermilab advanced computer program

    SciTech Connect

    Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

    1988-08-01

    Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

  7. Sandia secure processor : a native Java processor

    Microsoft Academic Search

    Gregory Lloyd Wickstrom; Jason Carl Gale; Kwok Kee Ma

    2003-01-01

    The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it

  8. SCAN secure processor and its biometric capabilities

    NASA Astrophysics Data System (ADS)

    Kannavara, Raghudeep; Mertoguno, Sukarno; Bourbakis, Nikolaos

    2011-04-01

    This paper presents the design of the SCAN secure processor and its extended instruction set to enable secure biometric authentication. The SCAN secure processor is a modified SparcV8 processor architecture with a new instruction set to handle voice, iris, and fingerprint-based biometric authentication. The algorithms for processing biometric data are based on the local global graph methodology. The biometric modules are synthesized in reconfigurable logic and the results of the field-programmable gate array (FPGA) synthesis are presented. We propose to implement the above-mentioned modules in an off-chip FPGA co-processor. Further, the SCAN-secure processor will offer a SCAN-based encryption and decryption of 32 bit instructions and data.

  9. Reconfigurable VLSI architecture for a database processor

    SciTech Connect

    Oflazer, K.

    1983-01-01

    This work brings together the processing potential offered by regularly structured VLSI processing units and the architecture of a database processor-the relational associative processor (RAP). The main motivations are to integrate a RAP cell processor on a few VLSI chips and improve performance by employing procedures exploiting these VLSI chips and the system level reconfigurability of processing resources. The resulting VLSI database processor consists of parallel processing cells that can be reconfigured into a large processor to execute the hard operations of projection and semijoin efficiently. It is shown that such a configuration can provide 2 to 3 orders of magnitude of performance improvement over previous implementations of the RAP system in the execution of such operations. 27 refs.

  10. Speculative parallelization of partially parallel loops

    E-print Network

    Dang, Francis Hoai Dinh

    2009-05-15

    , and applied a fully parallel data dependence test to determine if it had any cross–processor depen- dences. If the test failed, then the loop was re–executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops...

  11. Parallel Optimisation

    NSDL National Science Digital Library

    An introduction to optimisation techniques that may improve parallel performance and scaling on HECToR. It assumes that the reader has some experience of parallel programming including basic MPI and OpenMP. Scaling is a measurement of the ability for a parallel code to use increasing numbers of cores efficiently. A scalable application is one that, when the number of processors is increased, performs better by a factor which justifies the additional resource employed. Making a parallel application scale to many thousands of processes requires not only careful attention to the communication, data and work distribution but also to the choice of the algorithms to use. Since the choice of algorithm is too broad a subject and very particular to application domain to include in this brief guide we concentrate on general good practices towards parallel optimisation on HECToR.

  12. An FPGA Based SHA256 Processor

    Microsoft Academic Search

    Kurt K. Ting; Steve C. L. Yuen; Kin-hong Lee; Philip Heng Wai Leong

    2002-01-01

    The design, implementation and system level performance of an efficient yet compact field programmable gate array (FPGA) based\\u000a Secure Hash Algorithm 256 (SHA-256) processor is presented. On a Xilinx Virtex XCV300E-8 FPGA, the SHA-256 processor utilizes\\u000a 1261 slices and has a throughput of 87 MB\\/s at 88 MHz. When measured on actual hardware operating at 66 MHz, it had a

  13. Space-efficient scheduling of nested parallelism

    Microsoft Academic Search

    Girija J. Narlikar; Guy E. Blelloch

    1999-01-01

    Many of today's high-level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is

  14. An Approach To Portable Parallel Programs

    Microsoft Academic Search

    Charles C. Weems Jr

    1992-01-01

    Parallel architectures vary greatly in their organizations. These differences arise naturally from designing machines to fit different problem domains, and from different physical and cost constraints. Thus, the world is, and will continue to be, populated with parallel processors having significantly different organizations and hence, incompatible programming models: a program written specifically for one parallel processor does not transport directly

  15. Efficient design space exploration of high performance embedded out-of-order processors

    Microsoft Academic Search

    Stijn Eyerman; Lieven Eeckhout; Koen De Bosschere

    2006-01-01

    Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high- end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out- of-order processors, in-order processor design methodolo- gies which

  16. A neural net implementation of SPCA pre-processor for gas\\/odor classification using the responses of thick film gas sensor array

    Microsoft Academic Search

    N. S. Rajput; R. R. Das; V. N. Mishra; K. P. Singh; R. Dwivedi

    2010-01-01

    In this paper, an artificial neural net (ANN) implementation of SPCA pre-processing is presented for its use with a neural classifier trained with SPCA transformed data. Here, a SPCA transforming neural stage (Net ISPCA) is placed before a SPCA trained neural classifier stage (Net IISPCA). Accordingly, newer sensor array response of respective gas\\/odor can now be classified, more precisely, using

  17. Parallel I/O Systems

    NSDL National Science Digital Library

    Amy Apon

    * Redundant disk array architectures,* Fault tolerance issues in parallel I/O systems,* Caching and prefetching,* Parallel file systems,* Parallel I/O systems, * Parallel I/O programming paradigms, * Parallel I/O applications and environments, * Parallel programming with parallel I/O

  18. Parallel processing ITS

    SciTech Connect

    Fan, W.C.; Halbleib, J.A. Sr.

    1996-09-01

    This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

  19. Parallelizing Monte Carlo with PMC

    SciTech Connect

    Rathkopf, J.A.; Jones, T.R.; Nessett, D.M.; Stanberry, L.C.

    1994-11-01

    PMC (Parallel Monte Carlo) is a system of generic interface routines that allows easy porting of Monte Carlo packages of large-scale physics simulation codes to Massively Parallel Processor (MPP) computers. By loading various versions of PMC, simulation code developers can configure their codes to run in several modes: serial, Monte Carlo runs on the same processor as the rest of the code; parallel, Monte Carlo runs in parallel across many processors of the MPP with the rest of the code running on other MPP processor(s); distributed, Monte Carlo runs in parallel across many processors of the MPP with the rest of the code running on a different machine. This multi-mode approach allows maintenance of a single simulation code source regardless of the target machine. PMC handles passing of messages between nodes on the MPP, passing of messages between a different machine and the MPP, distributing work between nodes, and providing independent, reproducible sequences of random numbers. Several production codes have been parallelized under the PMC system. Excellent parallel efficiency in both the distributed and parallel modes results if sufficient workload is available per processor. Experiences with a Monte Carlo photonics demonstration code and a Monte Carlo neutronics package are described.

  20. Algorithmically Specialized Parallel Architecture For Robotics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    Computing system called Robot Mathematics Processor (RMP) contains large number of processor elements (PE's) connected in various parallel and serial combinations reconfigurable via software. Special-purpose architecture designed for solving diverse computational problems in robot control, simulation, trajectory generation, workspace analysis, and like. System an MIMD-SIMD parallel architecture capable of exploiting parallelism in different forms and at several computational levels. Major advantage lies in design of cells, which provides flexibility and reconfigurability superior to previous SIMD processors.

  1. A 0.8-?m CMOS two-dimensional programmable mixed-signal focal-plane array processor with on-chip binary imaging and instructions storage

    Microsoft Academic Search

    R. Dominguez-Castro; S. Espejo; A. Rodriguez-Vazquez; R. A. Carmona; P. Foldesy; A. Zarandy; P. Szolgay; T. Sziranyi; T. Roska

    1997-01-01

    This paper presents a CMOS chip for the parallel acquisition and concurrent analog processing of two-dimensional (2-D) binary images. Its processing function is determined by a reduced set of 19 analog coefficients whose values are programmable with 7-b accuracy. The internal programming signals are analog, but the external control interface is fully digital. On-chip nonlinear digital-to-analog converters (DAC's) map digitally

  2. Efficient local memory sequence generation for data parallel programs using permutations

    Microsoft Academic Search

    Tsung-chuan Huang; Liang-cheng Shiu; Jui-hsiang Huang

    2001-01-01

    Generating local memory access sequence is a critical issue in distributed-memory implementations of data-parallel languages. In this paper, for arrays distributed block-cyclically on multiple processors, we introduce a novel approach to the local memory access sequence generation using the theory of permutation. By compressing the active elements in a block into an integer, called compress number, and exploiting the fact

  3. Graph-Based Dynamic Assignment Of Multiple Processors

    NASA Technical Reports Server (NTRS)

    Hayes, Paul J.; Andrews, Asa M.

    1994-01-01

    Algorithm-to-architecture mapping model (ATAMM) is strategy minimizing time needed to periodically execute graphically described, data-driven application algorithm on multiple data processors. Implemented as operating system managing flow of data and dynamically assigns nodes of graph to processors. Predicts throughput versus number of processors available to execute given application algorithm. Includes rules ensuring application algorithm represented by graph executed periodically without deadlock and in shortest possible repetition time. ATAMM proves useful in maximizing effectiveness of parallel computing systems.

  4. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  5. Scioto: A Framework for Global-ViewTask Parallelism

    SciTech Connect

    Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

    2008-09-09

    We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

  6. Model Checking Reconfigurable Processor Configurations for Safety Properties

    E-print Network

    Kapur, Deepak

    Model Checking Reconfigurable Processor Configurations for Safety Properties John Cochran, Deepak@cs.unm.edu, WWW home page: http://www.cs.unm.edu/~slugboy/index.html Abstract. Reconfigurable processors pose of memory access by the reconfigurable array and memory ac- cess bounds checking. 1 Introduction

  7. Run-time Assignment of Tasks to Multiple Heterogeneous Processors

    E-print Network

    Al Hanbali, Ahmad

    : · General Purpose Processor (GPP), e.g. ARM, GPP FPGA DSRH DSP DSP ASICDSP DSRH DSRH GPP DSP ASIC ASIC DSRH DSRH DSP Fig. 1. SoC Template and the Mapping of a Process Graph · Digital Signal Processor (DSP [5] · Field Programmable Gate Array (FPGA), e.g. em- bedded FPGA's The best of both worlds (energy

  8. Performance Comparison of Graphics Processors to Reconfigurable Logic

    E-print Network

    Luk, Wayne

    -purpose processor (GPP). An FPGA is a device based on reconfigurable logic fabric. This work quantifies array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring

  9. A Multithreaded Soft Processor for SoPC Area Reduction

    E-print Network

    Brown, Stephen Dean

    ,davor,zvonko,brown}@eecg.toronto.edu ABSTRACT The growth in size and performance of Field Programmable Gate Arrays (FPGAs) has compelled System-on-a- Programmable-Chip (SoPC) designers to use soft proces- sors for controlling systems with large numbers of intellec- tual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor

  10. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad (Rochester, MN)

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  11. Asymmetric MultiProcessing Mobile Application Processor MP211

    Microsoft Academic Search

    Sunao TORII; Junji SAKAI; Hiroaki INOUE; Tatsuya TOKUE; Yoshiyuki ITO

    We propose several techniques of Asymmetric Multi-processing (AMP) for mobile application processors. In our AMP architecture, multiple general purpose processors are integrated on a chip to reduce hardware development period and cost. Our techniques are well considered the role of hardware and software to maximize the benefits of parallel processing. The hardware is carefully designed to enlarge bus bandwidth, shorten

  12. An efficient microcode compiler for application specific DSP processors

    Microsoft Academic Search

    Gert Goossens; Jan M. Rabaey; Joos Vandewalle; Hugo De Man

    1990-01-01

    A computer program for microcode compilation for custom digital signal processors is presented. This tool is part of the CATHEDRAL II silicon compiler. The following optimization problems are highlighted: scheduling, hardware assignment, and loop folding. Efficient techniques to solve these problems are developed. This allows for the automatic synthesis of processor architectures which simultaneously exploit pipelining and parallelism. A demonstrator

  13. Architectures for online error detection and recovery inmulticore processors

    Microsoft Academic Search

    Dimitris Gizopoulos; Mihalis Psarakis; Sarita V. Adve; Pradeep Ramachandran; Siva Kumar Sastry Hari; Daniel Sorin; Albert Meixner; Arijit Biswas; Xavier Vera

    2011-01-01

    The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chips. Extremely complex, massively parallel, multi-core processor chips fabricated in these technologies will become more vulnerable to: (a) environmental disturbances that produce transient (or

  14. Building the 4 Processor SB-PRAM Prototype

    Microsoft Academic Search

    Peter Bach; Michael Braun; Arno Formella; Jörg Friedrich; Thomas Griin; Cédric Lichtenau

    1997-01-01

    The SB-PRAM is a massively parallel, uniform memory access (UMA) shared memory computer. The main ideas of the design are multithreading on instruction level, hashing of the address space, and combining in the butterfly net- work. We have built a first research prototype with 4 physi- cal processors, thus 128 virtual processors, to demonstrate the feasibility of the concept. The

  15. Customization of application specific heterogeneous multi-pipeline processors

    Microsoft Academic Search

    Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

    2006-01-01

    In this paper we propose Application Specic Instruction Set Pro- cessors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the ap-

  16. Architecture and data processing alternatives for the TSE computer. Volume 3: Execution of a parallel counting algorithm using array logic (Tse) devices

    NASA Technical Reports Server (NTRS)

    Metcalfe, A. G.; Bodenheimer, R. E.

    1976-01-01

    A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.

  17. Transitive closure on the imagine stream processor

    SciTech Connect

    Griem, Gorden; Oliker, Leonid

    2003-11-11

    The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

  18. Application-specific Processor Architecture: Then and Now

    Microsoft Academic Search

    Peter R. Cappello

    2008-01-01

    We first relate the architecture of systolic arrays to the technological and economic design forces acting on architects of\\u000a special-purpose systems some 20 years ago. We then observe that those same design forces now are bearing down on the architects\\u000a of contemporary general-purpose processors, who consequently are producing general-purpose processors whose architectural\\u000a features are increasingly similar to those of systolic arrays.

  19. Machine-Description Driven Compilers for EPIC and VLIW Processors

    Microsoft Academic Search

    B. Ramakrishna Rau; Vinod Kathail; Shail Aditya

    1999-01-01

    In the past, due to the restricted gate count available on an inexpensive chip, embedded DSPs have had limited parallelism,\\u000a few registers and irregular, incomplete interconnectivity. More recently, with increasing levels of integration, embedded\\u000a VLIW processors have started to appear. Such processors typically have higher levels of instruction-level parallelism, more\\u000a registers, and a relatively regular interconnect between the registers and

  20. Design of a multithreaded instruction cache for a hyperscalar processor

    E-print Network

    Rajagopal, Arjun

    1993-01-01

    . Hyperscalar . B. Multithreaded processors . 1. Why multithreading . C. HISS the Hyperscalar architecture D. Multithreaded cache organization E. Pipelining of cache accesses 1. Lock up free cache access 1 4 5 7 8 9 10 10 10 11 12 12 13 13... rate. The available instruction level parallelism for general purpose applications is around 2. 5-3 instructions/cycle [4]. There are three approaches to take advantage of instruction level parallelism to improve performance. Superscalar processors...

  1. Architectures for reasoning in parallel

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.

    1989-01-01

    The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.

  2. COMPUTER ARCHITECTURE WITH ASSOCIATIVE PROCESSOR REPLACING LAST

    E-print Network

    Ginosar, Ran

    . -------------------- -------------------- 1 INTRODUCTION achine learning, data mining, network routing, search engines and other big data data storage and processing, and functions as a parallel SIMD processor and a memory at the same time archi- tectures include vector, or SIMD coprocessors [1][16][24]. However data transfer between

  3. Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System

    Microsoft Academic Search

    Yuan-chieh Chow; Walter H. Kohler

    1979-01-01

    Queueing models for a simple heterogeneous multiple processor system are presented, analyzed, and compared. Each model is distinguished by a job routing strategy which is designed to reduce the average job turnaround time by balancing the total load among the processors. In each case an arriving job is routed by a job dispatcher to one of m parallel processors. The

  4. Assembly Code Conversion of Software-Pipelined Loop between two VLIW DSP Processors

    E-print Network

    Su, Bogong

    Assembly Code Conversion of Software-Pipelined Loop between two VLIW DSP Processors Bogong Su 1 the instruction level parallelism of VLIW DSP processors, DSP programs have to be optimized by software pipelining of the target machine to obtain the optimized assembly code of the target DSP processor We have conducted

  5. Doppler-free, Multi-wavelength Acousto-optic deflector for two-photon addressing arrays of Rb atoms in a Quantum Information Processor

    E-print Network

    Sangtaek Kim; Robert R. Mcleod; Mark Saffman; Kelvin H. Wagner

    2007-11-21

    We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 nm and 480 nm) and have non-overlapping Bragg-matched frequency response at these wavelengths, so that there will be no crosstalk when proportional RF frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a Tellurium dioxide crystal (TeO2). The designed and fabricated AOD has more than 100 resolvable spots, widely separated bandshapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 nm and 480 nm), and a 4 usec or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD.

  6. Doppler-free, Multi-wavelength Acousto-optic deflector for two-photon addressing arrays of Rb atoms in a Quantum Information Processor

    E-print Network

    Kim, Sangtaek; Saffman, Mark; Wagner, Kelvin H

    2007-01-01

    We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 nm and 480 nm) and have non-overlapping Bragg-matched frequency response at these wavelengths, so that there will be no crosstalk when proportional RF frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a Tellurium dioxide crystal (TeO2). The designed and fabricated AOD has more than 100 resolvable spots, widely separated bandshapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 nm and 480 nm), and a 4 u...

  7. Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of Rb atoms in a quantum information processor.

    PubMed

    Kim, Sangtaek; Mcleod, Robert R; Saffman, M; Wagner, Kelvin H

    2008-04-10

    We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb87 atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 and 480 nm) and have nonoverlapping Bragg-matched frequency response at these wavelengths, so that there will be no cross talk when proportional frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a tellurium dioxide crystal (TeO(2)). The designed and fabricated AOD has more than 100 resolvable spots, widely separated band shapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 and 480 nm), and a 4 micros or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD. PMID:18404181

  8. A Parallel Differential Evolution Algorithm A Parallel Differential Evolution Algorithm

    Microsoft Academic Search

    Wojciech Kwedlo; Krzysztof Bandurski

    2006-01-01

    In the paper the problem of using a differential evolution algorithm for feed-forward neural network training is considered. A new parallelization scheme for the computation of the fitness function is proposed. This scheme is based on data decomposition. Both the learning set and the population of the evolutionary algorithm are distributed among processors. The processors form a pipeline using the

  9. Parallel Computations on Reconfigurable Meshes

    Microsoft Academic Search

    Russ Miller; Viktor K. Prasanna; Dionisios I. Reisis; Quentin F. Stout

    1993-01-01

    The mesh with reconfigurable bus is presented as a model of computation. The reconfigurable mesh captures salient features from a variety of sources, including the CAAPP, CHiP, polymorphic-torus network, and bus automation. It consists of an array of processors interconnected by a reconfigurable bus system that can be used to dynamically obtain various interconnection patterns between the processors. A variety

  10. Model-driven mapping onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The author addresses the problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine in the context of building a mapping compiler for a distributed memory parallel machine. He demonstrates the effectiveness of using execution models to select the best mapping technique from among those available for a given program segment on a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs on a linear processor array, it is shown that selecting the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from a mapping compiler for the Warp systolic array machine show that the execution models considered are accurate enough to select the best mapping technique for a given program.

  11. Parallel Algorithms for Computer Vision on the Connection Machine

    E-print Network

    Little, James J.

    1986-11-01

    The Connection Machine is a fine-grained parallel computer having up to 64K processors. It supports both local communication among the processors, which are situated in a two-dimensional mesh, and high-bandwidth ...

  12. Pthreads for dynamic and irregular parallelism

    Microsoft Academic Search

    Girija J. Narlikar; Guy E. Blelloch

    1998-01-01

    High performance applications on shared memory machines have typically been written in a coarse grained style, with one heavyweight thread per processor. In comparison, programming with a large number of lightweight, parallel threads has several advantages, including simpler coding for programs with irregular and dynamic parallelism, and better adaptability to a changing number of processors. The programmer can express a

  13. Parallel methods for the flight simulation model

    Microsoft Academic Search

    Wei Zhong Xiong; C. Swietlik

    1994-01-01

    The Advanced Computer Applications Center (ACAC) has been involved in evaluating advanced parallel architecture computers and the applicability of these machines to computer simulation models. The advanced systems investigated include parallel machines with shared. memory and distributed architectures consisting of an eight processor Alliant FX\\/8, a twenty four processor sor Sequent Symmetry, Cray XMP, IBM RISC 6000 model 550, and

  14. Data parallel sequential circuit fault simulation

    Microsoft Academic Search

    Minesh B. Amin; Bapiraju Vinnakota

    1996-01-01

    Sequential circuit fault simulation is a compute-intensive problem. Parallel simulation is one method to reduce fault simulation time. In this paper, we discuss a novel technique to partition the fault set for the fault parallel simulation of sequential circuits on multiple processors. When applied statically, the technique can scale well for up to thirty two processors on an ethernet. The

  15. Virtual Reality and Parallel Systems Performance Analysis

    Microsoft Academic Search

    Daniel A. Reed; Keith A. Shields; Will H. Scullin; Luis F. Tawera; Christopher L. Elford

    1995-01-01

    Recording and analyzing the dynamics of application program, system software, and hardware interactions are the keys to understanding and tuning the performance of massively parallel systems. Because massively parallel systems contain hundreds or thousands of processors, each potentially with many dynamic performance metrics, the performance data occupy a sparsely populated, high-dimensional space. These dynamic performance metrics for each processor define

  16. Sandia secure processor : a native Java processor.

    SciTech Connect

    Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

    2003-08-01

    The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

  17. The ARPA-MT Embedded SMT Processor and Its RTOS Hardware Accelerator

    Microsoft Academic Search

    Arnaldo S. R. Oliveira; Luís Almeida; António de Brito Ferrari

    2011-01-01

    The high-level modeling and parameterization capa- bilities of current hardware description languages, as well as the huge integration capacity and flexibility provided by modern field- programmable gate arrays (FPGAs), open the way to designing processors tuned to given applications and favoring specific prop- erties. This paper presents the Advanced Real-time Processor Ar- chitecture (ARPA)—MultiThreaded processor—a customizable, synthesizable, and time-predictable processor

  18. Parallel raster image processor for PCB manufacturing

    Microsoft Academic Search

    J. L. Martin; G. Aranguren; J. Ezquerra; P. Ibaiiez; R. Lasure; J. Van Campenhout

    1994-01-01

    The printed circuit board (PCB) is the base most commonly used for building electronic circuits. The design of the PCB is usually done by means of an adequate CAD program. The files generated by CAD programs can have different formats to describe printed circuit boards, but all formats essentially consist of the traces and the soldering pads that make up

  19. Master\\/slave speculative parallelization

    Microsoft Academic Search

    Craig B. Zilles; Gurindar S. Sohi

    2002-01-01

    Master\\/Slave Speculative Parallelization (MSSP) is an execution paradigm for improving the execution rate of sequential programs by parallelizing them speculatively for execution on a multiprocessor. In MSSP, one processor---the master---executes an approximate version of the program to compute selected values that the full program's execution is expected to compute. The master's results are checked by slave processors that execute the

  20. A low power front-end for embedded processors using a block-aware instruction set

    Microsoft Academic Search

    Ahmad Zmily; Christos Kozyrakis

    2007-01-01

    Energy, power, and area efficiency are critical design concerns for embedded processors. Much of the energy of a typical embedded processor is consumed in the front-end since instruction fetching happens on nearly every cycle and involves accesses to large memory arrays such as instruction and branch target caches. The use of small front-end arrays leads to significant power and area

  1. Appears in the IEEE Transactions on Parallel and Distributed Systems, June 95 Integer Programming for Array Subscript Analysis

    E-print Network

    Subhlok, Jaspal

    for Array Subscript Analysis Jaspal Subhlok School of Computer Science, Carnegie Mellon University analysis This research was sponsored in part by the Advanced Research Projects Agency/CSTO monitored, Pittsburgh PA 15213 Ken Kennedy Department of Computer Science, Rice University, Houston, TX 77251 Abstract

  2. High Density 3-D Integration Technology for Massively Parallel Signal Processing in Advanced Infrared Focal Plane Array Sensors

    Microsoft Academic Search

    D. Temple; C. A. Bower; D. Malta; J. E. Robinson; P. R. Coffman; M. R. Skokan; T. B. Welch

    2006-01-01

    The paper describes a platform technology for three-dimensional (3-D) integration of multiple layers of silicon integrated circuits. The technology promises to dramatically enhance on-chip signal processing capabilities of a variety of sensor and actuator devices hybridized with Si electronics. Among these applications are high performance infrared focal plane array detectors

  3. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195

  4. An Optimizing Algorithm for Extended CAM Processors with Threshold Search

    Microsoft Academic Search

    Takao TOTSUKA; Yuichiro MIYAOKA; Yuichiro ISHIKAWA; Nozomu TOGAWA; Masao YANAGISAWA; Tatsuo OHTSUKI

    An extended content addressable memory (CAM) realizes not only conventional equivalent search but parallel threshold search such as less-than search and greater-than search. In order to use the parallel processing function of CAM, parallel processing circuits are needed around a CAM cell array. Furthermore every application requires its specific CAM cell array and peripheral circuits. This paper proposes an optimizing

  5. An Experimental Digital Image Processor

    NASA Astrophysics Data System (ADS)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  6. SPROC: A multiple-processor DSP IC

    NASA Technical Reports Server (NTRS)

    Davis, R.

    1991-01-01

    A large, single-chip, multiple-processor, digital signal processing (DSP) integrated circuit (IC) fabricated in HP-Cmos34 is presented. The innovative architecture is best suited for analog and real-time systems characterized by both parallel signal data flows and concurrent logic processing. The IC is supported by a powerful development system that transforms graphical signal flow graphs into production-ready systems in minutes. Automatic compiler partitioning of tasks among four on-chip processors gives the IC the signal processing power of several conventional DSP chips.

  7. A Parallel Tree Code

    E-print Network

    John Dubinski

    1996-03-18

    We describe a new implementation of a parallel N-body tree code. The code is load-balanced using the method of orthogonal recursive bisection to subdivide the N-body system into independent rectangular volumes each of which is mapped to a processor on a parallel computer. On the Cray T3D, the load balance in the range of 70-90\\% depending on the problem size and number of processors. The code can handle simulations with $>$ 10 million particles roughly a factor of 10 greater than allowed in vectorized tree codes.

  8. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, Michael S. (Ridgefield, CT); Strip, David R. (Albuquerque, NM)

    1996-01-01

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

  9. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, M.S.; Strip, D.R.

    1996-01-30

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

  10. Fault tolerance techniques for systolic arrays

    SciTech Connect

    Abraham, J.A.; Banerjee, P.; Chen, C.Y.; Fuchs, W.K.; Kua, S.Y.; Reddy, A.L.N. (Univ. of Illinois)

    1987-07-01

    Digital systems that are operated in applications where there is a high cost of failure require high reliability and continuous operation. Since it is impossible to guarantee that portions of a system will never fail, such systems need to be designed to tolerate failures of the system components. The discipline of fault-tolerant computing is, therefore, one which has attracted a great deal of research interest. Researchers have attempted to derive highly effective and, at the same time, efficient techniques to tolerate failures in complex digital systems. The high computation needs of many applications can now be met through the use of highly parallel special-purpose systems that can be produced very cost effectively through the use of very large scale integration (VLSI) technology. Systolic arrays, such as the ESL systolic array and the Carnegie Mellon Wrap processor, are examples of such systems.

  11. Measuring Parallelism in Computation-Intensive Scientific\\/Engineering Applications

    Microsoft Academic Search

    Manoj Kumar

    1988-01-01

    Describes COMET, (concurrency measurement tool), a software tool for measuring parallelism in large scientific\\/engineering applications. The proposed tool measures the total parallelism present in programs, filtering out the effects of communication\\/synchronization delays, finite storage, limited number of processors, the policies for management of processors and storage, etc. Although an ideal machine that can exploit the total parallelism is not realizable,

  12. A scheme for handling arrays in data-flow systems

    NASA Technical Reports Server (NTRS)

    Gaudiot, J.-L.; Ercegovac, M. D.

    1982-01-01

    An examination of the effects of atomicity (higher resolution) on the performance of array processors (data-flow computers) is presented. Data-flow principles are reviewed, noting the reliance on parallel processing using functional languages to specify sequencing of the operations. Techniques are described for eliminating the necessity of copying whole arrays between processing steps, thereby reducing the number of store cycles. The method involves setting whole columns to specific values rather than individual elements. The individual column values can be processed in parallel, i.e., a locally optimized condition exists. A drawback of the system is the need for more low level arguments, to identify the appropriate processing sequences, and high system complexity.

  13. Dynamic parallel complexity of computational circuits

    Microsoft Academic Search

    Gary L. Miller; Shang-Hua Teng

    1987-01-01

    The dynamic parallel complexity of general computational circuits (defined in introduction) is discussed. We exhibit some relationships between parallel circuit evaluation and some uniform closure properties of a certain class of unary functions and present a systematic method for the design of processor efficient parallel algorithms for circuit evaluation. Using this method: (1) we improve the algorithm for parallel Boolean

  14. Integration of micro-optics with a fiber array connector using passive alignment technique for parallel optics applications

    Microsoft Academic Search

    Hongtao Han; Jim Morris; Adam Fedor; Bingzhi Su; David Aichele; Eden Chen; Holly Weathersbee; Alexey Semakov

    2004-01-01

    The micro-optic chips are made of glass material using wafer scale photolithography and etching techniques, and micro optical elements are fabricated on both sides. The fiber array connector is an injection molded plastic receptacle, which contains an interface for the MT connector and a precision cavity for passive alignment. For the 12-channel transmitter optical sub-assembly (OSA), we incorporated 12 diffractive

  15. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-03-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

  16. A VHDL-based protocol controller for NCAP processors

    Microsoft Academic Search

    S. R. Rossi; E. D. Moreno; A. A. De Carvalho; A. C. R. Da Silva; E. A. Batista; T. A. Prado; T. A. Santos Filho

    2009-01-01

    This work presents the development of an IEEE 1451.2 protocol controller based on a low-cost FPGA that is directly connected to the parallel port of a conventional personal computer. In this manner it is possible to implement a Network Capable Application Processor (NCAP) based on a personal computer, without parallel port modifications. This approach allows supporting the ten signal lines

  17. A dynamic associative processor for machine vision applications

    Microsoft Academic Search

    Frederick P. Herrmann; C. G. Sodini

    1992-01-01

    The use of massively parallel associative processors as coprocessors for accelerating machine vision applications is considered. They achieve very fine granularity, as every word of memory functions as a simple processing element. A dense, dynamic, content-addressable memory cell supports fully parallel operation, and pitch-matched word logic improves arithmetic performance with minimal area cost. An asynchronous reconfigurable mesh network handles interprocessor

  18. Linear array implementation of the EM algorithm for PET image reconstruction

    SciTech Connect

    Rajan, K.; Patnaik, L.M.; Ramakrishna, J. [Indian Institute of Science, Bangalore (India)] [Indian Institute of Science, Bangalore (India)

    1995-08-01

    The PET image reconstruction based on the EM algorithm has several attractive advantages over the conventional convolution back projection algorithms. However, the PET image reconstruction based on the EM algorithm is computationally burdensome for today`s single processor systems. In addition, a large memory is required for the storage of the image, projection data, and the probability matrix. Since the computations are easily divided into tasks executable in parallel, multiprocessor configurations are the ideal choice for fast execution of the EM algorithms. In tis study, the authors attempt to overcome these two problems by parallelizing the EM algorithm on a multiprocessor systems. The parallel EM algorithm on a linear array topology using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PE`s) has been implemented. The performance of the EM algorithm on a 386/387 machine, IBM 6000 RISC workstation, and on the linear array system is discussed and compared. The results show that the computational speed performance of a linear array using 8 DSP chips as PE`s executing the EM image reconstruction algorithm is about 15.5 times better than that of the IBM 6000 RISC workstation. The novelty of the scheme is its simplicity. The linear array topology is expandable with a larger number of PE`s. The architecture is not dependant on the DSP chip chosen, and the substitution of the latest DSP chip is straightforward and could yield better speed performance.

  19. Design definition for a digital beamforming processor

    NASA Astrophysics Data System (ADS)

    Langston, J. L.; Sanzgiri, Shashikant; Hinman, Karl; Keisner, Kevin; Garcia, Domingo

    1988-04-01

    Very large scale integrated circuit technology now makes large bandwidth digital beamforming array antennas practical. Algorithms and architectures were investigated for the implementation of a processor capable of producing large bandwidth multiple output beams for both near and far-term applications. Algorithms in element space and beam space were investigated. Structures for dedicated algorithm execution with highly pipelined, systolic hardware were examined. Arithmetic execution alternatives were considered. The impact of channel errors were investigated and methods of calibrating the beamformer to compensate for these errors were developed. The effects of quantization errors were investigated and processor dynamic range requirements were assessed. The capabilities of Si and GaAs technologies were assessed. The implementation of a processor chip set using Application Specific Integrated Circuits (ASIC) was investigated. A recommended brassboard demonstration system design was derived.

  20. Rapid, Single-Molecule Assays in Nano/Micro-Fluidic Chips with Arrays of Closely Spaced Parallel Channels Fabricated by Femtosecond Laser Machining

    PubMed Central

    Canfield, Brian K.; King, Jason K.; Robinson, William N.; Hofmeister, William H.; Davis, Lloyd M.

    2014-01-01

    Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

  1. National Resource for Computation in Chemistry (NRCC). Attached scientific processors for chemical computations: a report to the chemistry community

    SciTech Connect

    Ostlund, N.S.

    1980-01-01

    The demands of chemists for computational resources are well known and have been amply documented. The best and most cost-effective means of providing these resources is still open to discussion, however. This report surveys the field of attached scientific processors (array processors) and attempts to indicate their present and possible future use in computational chemistry. Array processors have the possibility of providing very cost-effective computation. This report attempts to provide information that will assist chemists who might be considering the use of an array processor for their computations. It describes the general ideas and concepts involved in using array processors, the commercial products that are available, and the experiences reported by those currently using them. In surveying the field of array processors, the author makes certain recommendations regarding their use in computational chemistry. 5 figures, 1 table (RWR)

  2. Parallel fault-tolerant robot control

    NASA Astrophysics Data System (ADS)

    Hamilton, Deirdre L.; Bennett, John K.; Walker, Ian D.

    1992-11-01

    Most robot controllers today employ a single processor architecture. As robot control requirements become more complex, these serial controllers have difficulty providing the desired response time. Additionally, with robots being used in environments that are hazardous or inaccessible to humans, fault-tolerant robotic systems are particularly desirable. A uniprocessor control architecture cannot offer tolerance of processor faults. Use of multiple processors for robot control offers two advantages over single processor systems. Parallel control provides a faster response, which in turn allows a finer granularity of control. Processor fault tolerance is also made possible by the existence of multiple processors. There is a trade-off between performance and the level of fault tolerance provided. This paper describes a shared memory multiprocessor robot controller that is capable of providing high performance and processor fault tolerance. We evaluate the performance of this controller, and demonstrate how performance and processor fault tolerance can be balanced in a cost- effective manner.

  3. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  4. Parallel VLSI architecture emulation and the organization of APSA/MPP

    NASA Technical Reports Server (NTRS)

    Odonnell, John T.

    1987-01-01

    The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

  5. Multiple Embedded Processors for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

    2005-01-01

    A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

  6. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  7. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    SciTech Connect

    Liao, C; Quinlan, D J; Willcock, J J; Panas, T

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

  8. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.

  9. A 0.4-V UWB baseband processor

    Microsoft Academic Search

    Vivienne Sze; Anantha P. Chandrakasan

    2007-01-01

    ABSTRACT A 0.4-V UWB digital baseband processor has been fabri- cated in a standard-VT 90-nm CMOS technology. The base- band processor operates at an ultra-low supply voltage to reduce energy consumption and utilizes a highly parallelized architecture to meet throughput constraints. While ultra- low voltage operation is usually limited to low energy, low performance applications, this work examines how it

  10. Compact optical processor for Hough and frequency domain features

    NASA Astrophysics Data System (ADS)

    Ott, Peter

    1996-11-01

    Shape recognition is necessary in a broad band of applications such as traffic sign or work piece recognition. It requires not only neighborhood processing of the input image pixels but global interconnection of them. The Hough transform (HT) performs such a global operation and it is well suited in the preprocessing stage of a shape recognition system. Translation invariant features can be easily calculated form the Hough domain. We have implemented on the computer a neural network shape recognition system which contains a HT, a feature extraction, and a classification layer. The advantage of this approach is that the total system can be optimized with well-known learning techniques and that it can explore the parallelism of the algorithms. However, the HT is a time consuming operation. Parallel, optical processing is therefore advantageous. Several systems have been proposed, based on space multiplexing with arrays of holograms and CGH's or time multiplexing with acousto-optic processors or by image rotation with incoherent and coherent astigmatic optical processors. We took up the last mentioned approach because 2D array detectors are read out line by line, so a 2D detector can achieve the same speed and is easier to implement. Coherent processing can allow the implementation of tilers in the frequency domain. Features based on wedge/ring, Gabor, or wavelet filters have been proven to show good discrimination capabilities for texture and shape recognition. The astigmatic lens system which is derived form the mathematical formulation of the HT is long and contains a non-standard, astigmatic element. By methods of lens transformation s for coherent applications we map the original design to a shorter lens with a smaller number of well separated standard elements and with the same coherent system response. The final lens design still contains the frequency plane for filtering and ray-tracing shows diffraction limited performance. Image rotation can be done optically by a rotating prism. We realize it on a fast FLC- SLM of our lab as input device. The filters can be implemented on the same type of SLM with 128 by 128 square pixels of size, resulting in a total length of the lens of less than 50cm.

  11. Data Parallel SwitchLevel Simulation \\Lambda Randal E. Bryant

    E-print Network

    Bryant, Randal E.

    Mellon University Abstract Data parallel simulation involves simulating the be­ havior of a circuit over runs on a a massively­ parallel SIMD machine, with each processor simulat­ ing the circuit behavior parallelism in simulation utilize circuit parallelism. In this mode, the simulator extracts parallelism from

  12. Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging.

    PubMed

    El-Ghussein, Fadi; Mastanduno, Michael A; Jiang, Shudong; Pogue, Brian W; Paulsen, Keith D

    2014-01-01

    A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

  13. Algorithmic commonalities in the parallel environment

    NASA Technical Reports Server (NTRS)

    Mcanulty, Michael A.; Wainer, Michael S.

    1987-01-01

    The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.

  14. MAP3D: a media processor approach for high-end 3D graphics

    NASA Astrophysics Data System (ADS)

    Darsa, Lucia; Stadnicki, Steven; Basoglu, Chris

    1999-12-01

    Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.

  15. Array architectures for block matching algorithms

    Microsoft Academic Search

    T. Komarek; P. Pirsch

    1989-01-01

    A description is given of VLSI architectures for block-matching algorithms utilizing systolic array processors. A well-known mapping procedure has been applied to derive the array processors from the algorithm. Examples of two- and one-dimensional systolic arrays are presented. The transistor-count of the architectures using presently available CMOS technology and their maximum processable frame rates for real-time computation of video signals

  16. Configurable Multi-Purpose Processor

    NASA Technical Reports Server (NTRS)

    Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

    2010-01-01

    Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and operate in a launch environment. Additionally, the CTP was designed to interface with the WFF (Wallops Flight Facility) custom-designed transceiver board which is used in the Low Cost TDRSS Transceiver (LCT2) also developed by WFF. The LCT2 s transceiver board demodulates commands received from the ground via the forward link and sends them to the CTP, where they are processed. The CTP inputs and processes data from the inertial measurement unit (IMU) and the GPS receiver board, generates status data, and then sends the data to the transceiver board where it is modulated and sent to the ground via the return link. Overall, the CTP has combined processing with the ability to interface to a GPS receiver, an IMU, and a pulse code modulation (PCM) communication link, while providing the capability to support common interfaces including Ethernet and serial interfaces boarding a relatively small-sized, lightweight package.

  17. Multi-microprocessor that executes pure LISP in parallel

    SciTech Connect

    Guzman, A.

    1982-01-01

    The architecture presented allows parallel computation of high level languages, with some advantages: (1) the programmer is unaware that he is writing programs for a parallel computer; (2) the processors communicate little with each other, so that interconnection problems are minimised; (3) a given processor is unaware of how many other processors there are, or what they are doing; (4) a processor never waits for another process to have finished, nor does it awake or interrupt another processor. The machine processes in parallel programs written in high level languages capable of being expressed in the lambda notation (applicative languages). It is formed by a collection of general purpose processors which are weakly coupled and without hierarchy. Asynchronous computation is permitted by each processor evaluating a part of a program. 17 references.

  18. NWChem: scalable parallel computational chemistry

    SciTech Connect

    van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

    2011-11-01

    NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

  19. Is Monte Carlo embarrassingly parallel?

    SciTech Connect

    Hoogenboom, J. E. [Delft Univ. of Technology, Mekelweg 15, 2629 JB Delft (Netherlands); Delft Nuclear Consultancy, IJsselzoom 2, 2902 LB Capelle aan den IJssel (Netherlands)

    2012-07-01

    Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

  20. Parallel Modem Architectures for High-Data-Rate Space Modems

    NASA Astrophysics Data System (ADS)

    Satorius, E.

    2014-08-01

    Existing software-defined radios (SDRs) for space are limited in data volume by several factors, including bandwidth, space-qualified analog-to-digital converter (ADC) technology, and processor throughput, e.g., the throughput of a space-qualified field-programmable gate array (FPGA). In an attempt to further improve the throughput of space-based SDRs and to fully exploit the newer and more capable space-qualified technology (ADCs, FPGAs), we are evaluating parallel transmitter/receiver architectures for space SDRs. These architectures would improve data volume for both deep-space and particularly proximity (e.g., relay) links. In this article, designs for FPGA implementation of a high-rate parallel modem are presented as well as both fixed- and floating-point simulated performance results based on a functional design that is suitable for FPGA implementation.

  1. Design of a PSi header processor for the internet protocol 

    E-print Network

    Bai, Jinxia

    1995-01-01

    , the Full flag is set Each of the CAM array words is provided with an Empty bit. In order to read 12 a word from the CAM array, the address from where the word is to be read(the CP number in this application) is loaded into the Command Register... or DC from upper layer 10 CR or CC from upper layer 11 Send Data from upper layer 32 01 from lower layer Buffer Upper Layer Addr 32 Header Processor Data 32 MAIN MEMORY ?0 . ' Cp . CP CP Addr Data 32 Output Processor Lower Layer 32 32...

  2. Kismet: parallel speedup estimates for serial programs

    Microsoft Academic Search

    Donghwan Jeon; Saturnino Garcia; Chris Louie; Michael Bedford Taylor

    2011-01-01

    Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs

  3. MAPS: multi-algorithm parallel circuit simulation

    Microsoft Academic Search

    Xiaoji Ye; Wei Dong; Peng Li; Sani R. Nassif

    2008-01-01

    The emergence of multi-core and many-core processors has introduced new opportunities and challenges to EDA research and development. While the availability of increasing parallel computing power holds new promise to address many computing challenges in CAD, the leverage of hardware parallelism can only be possible with a new generation of parallel CAD applications. In this paper, we propose a novel

  4. Synthetic aperture radar processing facility based on a parallel supercomputer

    Microsoft Academic Search

    S. Holm; A. Maoy

    1989-01-01

    A high-performance processing facility for synthetic aperture radar (SAR) is described. The SAR processor is designed for the ERS-1 remote sensing satellite and will process a 100-km by 100-km scene in less than eight minutes. This is three times the throughput of comparable facilities. The SAR processor is built around a 320-MFLOP parallel processor. The front-end processor is a superminicomputer,

  5. Universal schemes for parallel communication

    Microsoft Academic Search

    Leslie G. Valiant; Gordon J. Brebner

    1981-01-01

    In this paper we isolate a combinatorial problem that, we believe, lies at the heart of this question and provide some encouragingly positive solutions to it. We show that there exists an N-processor realistic computer that can simulate arbitrary idealistic N-processor parallel computations with only a factor of O(log N) loss of runtime efficiency. The main innovation is an O(log

  6. Parallel symmetry-breaking in sparse graphs

    Microsoft Academic Search

    Andrew V. Goldberg; Serge A. Plotkint; Gregory E. Shannon

    1987-01-01

    We describe efficient deterministic techniques for breaking symmetry in parallel. The techniques work well on rooted trees and graphs of constant degree or genus. Our primary technique allows us to 3-color a rooted tree in &Ogr;(lg*n) time on an EREW PRAM using a linear number of processors. We apply these techniques to construct fast linear processor algorithms for several problems,

  7. Parallel Computing, Failure Recovery, and Extreme Values

    Microsoft Academic Search

    Lars Nørvang Andersen; Søren Asmussen

    2008-01-01

    A task of random size T is split into M subtasks of lengths T1,…, TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1,…, TM are the total task times at the M

  8. The Parallel Iterative Closest Point Algorithm

    Microsoft Academic Search

    Christian Langis; Michael A. Greenspan; Guy Godin

    2001-01-01

    This paper describes a parallel implementation developed to improve the time performance of the Iterative Closest Point Algorithm. Within each iteration, the correspon- dence calculations are distributed among the processor re- sources. At the end of each iteration, the results of the cor- respondence determination are communicated back to a central processor and the current transformation is calcu- lated. A

  9. TILE64 - Processor: A 64Core SoC with Mesh Interconnect

    Microsoft Academic Search

    S. Bell; B. Edwards; J. Amann; R. Conlin; K. Joyce; V. Leung; J. MacKay; M. Reif; Liewei Bao; J. Brown; M. Mattina; Chyi-Chang Miao; C. Ramey; D. Wentzlaff; W. Anderson; E. Berger; N. Fairbanks; D. Khan; F. Montenegro; J. Stickney; J. Zook

    2008-01-01

    The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I\\/Os on the periphery. Each general-purpose processor is identical and

  10. Fat-Btree: An Update-Conscious Parallel Directory Structure

    Microsoft Academic Search

    Haruo Yokota; Yasuhiko Kanemasa; Jun Miyazaki

    1999-01-01

    We propose a parallel directory structure, Fat-Btree, to improve high speed access for parallel database systems in shared nothing environments. The Fat-Btree has a threefold aim: to provide an indexing mechanism for fast retrieval in each processor; to balance the amount of data among distributed disks, and to reduce synchronization costs between processors during update operations. We use a probability

  11. Nested parallelism for multi-core HPC systems using Java

    Microsoft Academic Search

    Aamir Shafi; Bryan Carpenter; Mark Baker

    2009-01-01

    Since its introduction in 1993, the Message Passing Interface (MPI) has become a de facto standard for writing High Performance Computing (HPC) applications on clusters and Massively Parallel Processors (MPPs). The recent emergence of multi-core processor systems presents a new challenge for established parallel programming paradigms, including those based on MPI. This paper presents a new Java messaging system called

  12. Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC

    Microsoft Academic Search

    Sascha Uhrig

    2009-01-01

    Multicore processors get more and more popular, even in embedded systems. Unfortunately, these types of processors require\\u000a a special kind of programming technique to offer their full performance, i.e. they require a high thread-level parallelism.\\u000a In this paper we evaluate the performance of different configurations of the same processor core within an SoPC: a single\\u000a threaded single core, a multithreaded

  13. Implemention of 128Point Fast Fourier Transform Processor for UWB Systems

    Microsoft Academic Search

    Sang-In Cho; Kyu-Min Kang; Sang-Sung Choi

    2008-01-01

    In this paper, we present a 4-parallel fast Fourier transform (FFT) processor for a multi-band orthogonal frequency division multiplexing (MB-OFDM) ultra wideband (UWB) system. The proposed FFT processor utilizes radix-24 structure so as to significantly enhance the hardware complexity by reducing the numbers of multipliers and adders. The hardware efficient 4-parellel 128-point FFT processor employing the decimation-in-frequency (DIF) and the

  14. A full fill-factor CCD imager with integrated signal processors

    Microsoft Academic Search

    Woodward Yang; Alice Chiang

    1990-01-01

    A (64*64) imager that combines simple charge-domain analog signal processors in a parallel, pipelined architecture to realize an integrated signal processor that performs a simple edge detection algorithm in real time is described. With a serial output clock rate of 10 MHz, the signal processor is capable of 1000-frames\\/s operation. This signal-processing capability is implemented with standard CMOS technology without

  15. Parallel VLSI Circuit Analysis and Optimization

    E-print Network

    Ye, Xiaoji

    2012-02-14

    The prevalence of multi-core processors in recent years has introduced new opportunities and challenges to Electronic Design Automation (EDA) research and development. In this dissertation, a few parallel Very Large Scale Integration (VLSI) circuit...

  16. Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures

    Microsoft Academic Search

    Jairo Panetta; P. R. P. de Souza Filho; C. A. da Cunha Filho; F. M. R. da Motta; S. S. Pinheiro; I. Pedrosa; A. L. R. Rosa; L. R. Monnerat; L. T. Carneiro; C. H. B. de Albrecht

    2007-01-01

    We describe the computational characteristics of the Kirchhoff prestack seismic migration currently used in daily production runs at Petrobras and its port to novel architectures. Fully developed in house, this portable and fault tolerant application has high sequential and parallel efficiency, with parallel scalability tested up to 8192 processors on the IBM Blue Gene without exhausting parallelism. Production load comprises

  17. Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

    NASA Technical Reports Server (NTRS)

    Gentzsch, W.

    1982-01-01

    Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

  18. Conversion via software of a simd processor into a mimd processor

    SciTech Connect

    Guzman, A.; Gerzso, M.; Norkin, K.B.; Vilenkin, S.Y.

    1983-01-01

    A method is described which takes a pure LISP program and automatically decomposes it via automatic parallelization into several parts, one for each processor of an SIMD architecture. Each of these parts is a different execution flow, i.e., a different program. The execution of these different programs by an SIMD architecture is examined. The method has been developed in some detail for the PS-2000, an SIMD Soviet multiprocessor, making it behave like AHR, a Mexican MIMD multi-microprocessor. Both the PS-2000 and AHR execute a pure LISP program in parallel; its decomposition into >n> pieces, their synchronization, scheduling, etc., are performed by the system (hardware and software). In order to achieve simultaneous execution of different programs in an SIMD processor, the method uses a scheme of node scheduling and node exportation. 14 references.

  19. Parallel Ab initio quantum chemistry on pentium-pro networks

    SciTech Connect

    Seidl, E.; Janssen, C.; Colvin, M. [Sandia National Lab., Albuquerque, NM (United States)

    1997-12-31

    As the performance of inexpensive PCS approaches that of the fastest single processor supercomputers, high-end computing is increasingly dominated by massively parallel computers with hundreds or thousands of CPUs. Although such systems are essential for many applications, smaller parallel computers can achieve much lower price/performance ratios using commodity processors and interconnections. To investigate the feasibility of this approach for parallel quantum chemistry we have constructed a 56 processor parallel computer from fourteen 4-processor shared memory Pentium-Pro motherboards. These are interconnected by a 100 Mbit/sec. fast ethernet switch and each motherboard has 256 Mbytes of RAM and 1 Gbyte of disk. The system runs the LINUX operating system which supports symmetric multiprocessing on each four processor motherboard. Although some bottlenecks still exist in the inter-system communication, we have achieved very reasonable speedups running our massively parallel quantum chemistry program (MPQC).

  20. Fast Parallel Computation Of Multibody Dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Kwan, Gregory L.; Bagherzadeh, Nader

    1996-01-01

    Constraint-force algorithm fast, efficient, parallel-computation algorithm for solving forward dynamics problem of multibody system like robot arm or vehicle. Solves problem in minimum time proportional to log(N) by use of optimal number of processors proportional to N, where N is number of dynamical degrees of freedom: in this sense, constraint-force algorithm both time-optimal and processor-optimal parallel-processing algorithm.

  1. VLSI Processor For Vector Quantization

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul

    1995-01-01

    Pixel intensities in each kernel compared simultaneously with all code vectors. Prototype high-performance, low-power, very-large-scale integrated (VLSI) circuit designed to perform compression of image data by vector-quantization method. Contains relatively simple analog computational cells operating on direct or buffered outputs of photodetectors grouped into blocks in imaging array, yielding vector-quantization code word for each such block in sequence. Scheme exploits parallel-processing nature of vector-quantization architecture, with consequent increase in speed.

  2. DMA Performance Analysis and Multi-core Memory Optimization for SWIM Benchmark on the Cell Processor

    Microsoft Academic Search

    Yong Dou; Lin Deng; Jinhui Xu; Yi Zheng

    2008-01-01

    The Cell processor is a typical heterogeneous multi-core processor, which owns powerful computing capability. But we are facing the challenges of 'memory wall' in developing parallel applications, such as, limited capacity of local memory, limited memory bandwidth for multi-cores and the long latency for data communication. The DMA transfer mechanism is often used to hide the long latency and improve

  3. COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION USING MESH-CONNECTED PROCESSORS

    Microsoft Academic Search

    R. P. BRENT; F. T. LUK; CHARLES VAN LOAN

    1985-01-01

    A cyclic Jacobi method for computing the singular value decomposition of an m ? n ma- trix (m n) using systolic arrays is proposed. The algorithm requires O(n2) processors and O(m + nlogn) units of time.

  4. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  5. DSP Processors Politecnico di Milano

    E-print Network

    Alippi, Cesare

    DSP Processors Politecnico di Milano #12; DSP Processors Signal processing (audio, ... Optimized architectures for DSP algorithms #12; Typical DSP Applications Digital samples from physical #12; DSP Algorithms Features Iterative numeric computation on large data High fidelity numeric

  6. Processors for Mobile Applications

    Microsoft Academic Search

    Farinaz Koushanfar; Miodrag Potkonjak; Vandana Prabhu; Jan M. Rabaey

    2000-01-01

    Mobile processors form a large and very fast growing segment of semiconductor market. Although they are used in a great variety of embedded systems such as personal digital organizers (PDAs), smart cards, internet appliances, laptops, smart badges, cellular phones, wearable computers, and sensor networks, they share the common need for low power, code density, security, cost sensitivity and multimedia and

  7. Ultra Dependable Processor

    NASA Astrophysics Data System (ADS)

    Sakai, Shuichi; Goshima, Masahiro; Irie, Hidetsugu

    This paper presents the processor architecture which provides much higher level dependability than the current ones. The features of it are: (1) fault tolerance and secure processing are integrated into a modern superscalar VLSI processor; (2) light-weight effective soft-error tolerant mechanisms are proposed and evaluated; (3) timing errors on random logic and registers are prevented by low-overhead mechanisms; (4) program behavior is hidden from the outer world by proposed address translation methods; (5) information leakage can be avoided by attaching policy tags for all data and monitoring them for each instruction execution; (6) injection attacks are avoided with much higher accuracy than the current systems, by providing tag trackings; (7) the overall structure of the dependable processor is proposed with a dependability manager which controls the detection of illegal conditions and recovers to the normal mode; and (8) an FPGA-based testbed system is developed where the system clock and the voltage are intentionally varied for experiment. The paper presents the fundamental scheme for the dependability, elemental technologies for dependability and the whole architecture of the ultra dependable processor. After showing them, the paper concludes with future works.

  8. Efficient Spare Allocation for Reconfigurable Arrays

    Microsoft Academic Search

    Sy-Yen Kuo; W. K. Fuchs

    1987-01-01

    Yield degradation from physical failures in large memories and processor arrays is of significant concern to semiconductor manufacturers. One method of increasing the yield for iterated arrays of memory cells or processing elements is to incorporate spare rows and columns in the die or wafer. These spare rows and columns can then be programmed into the array. The authors discuss

  9. An implementation of a parallel ray tracing algorithm on hybrid parallel architecture

    Microsoft Academic Search

    Chang-Geun Kwon; Hyo-Kyung Sung; Heung-Moon Choi

    1998-01-01

    We present a parallel ray tracing algorithm on hybrid parallel architecture with processor a farm model to speed up the ray tracing. The hybrid parallel architecture, a hybrid of a tightly- and a loosely-coupled one, is used in which reconfiguration for local and virtual shared memory is made through a crossbar network with a local and global bus. The proposed

  10. Optimised reconfigurable MAC processor architecture

    Microsoft Academic Search

    Marios Iliopoulos; Theodore Antonakopoulos

    2001-01-01

    Inefficient resources utilization is met in various embedded communication devices, which are based on standard processor cores and custom hardware modules. This paper addresses the inefficient resources utilization problem in MAC processor designs and presents a solution that is based on a reconfigurable processor architecture and on dynamic-static instruction partitioning, depending on medium access protocol requirements. The presented instruction partitioning

  11. A parallel algorithm for channel routing on a hypercube

    NASA Technical Reports Server (NTRS)

    Brouwer, Randall; Banerjee, Prithviraj

    1987-01-01

    A new parallel simulated annealing algorithm for channel routing on a P processor hypercube is presented. The basic idea used is to partition a set of tracks equally among processors in the hypercube. In parallel, P/2 pairs of processors perform displacements and exchanges of nets between tracks, compute the changes in cost functions, and accept moves using a parallel annealing criteria. Through the use of a unique distributed data structure, it is possible to minimize message traffic and add versatility and efficiency in a parallel routing tool. The algorithm has been implemented and is being tested on some of the popular channel problems from the literature.

  12. Software-based instruction caching for embedded processors

    Microsoft Academic Search

    Jason E. Miller; Anant Agarwal

    2006-01-01

    While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are di- rectly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per

  13. Software-based instruction caching for embedded processors

    Microsoft Academic Search

    Jason E. Miller; Anant Agarwal

    2006-01-01

    While hardware instruction caches are present in virtually all general-purpose and high-performance microprocessors today, many embedded processors use SRAM or scratchpad memories instead. These are simple array memory structures that are directly addressed and explicitly managed by software. Compared to hardware caches of the same data capacity, they are smaller, have shorter access times and consume less energy per access.

  14. Implementing clips on a parallel computer

    NASA Technical Reports Server (NTRS)

    Riley, Gary

    1987-01-01

    The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

  15. The Dynamic Adaptation of Parallel Mesh-Based Computation

    E-print Network

    Savage, John

    The Dynamic Adaptation of Parallel Mesh-Based Computation Jos e G. Casta~nos John E. Savage adaptation of unstructured FE meshes on loosely coupled parallel processors. We describe a a parallel adaptation algorithm, b an online parallel repar- titioning algorithm based on mesh adaptation histories, c

  16. An extensible infrastructure for benchmarking Multi-core Processors based systems

    Microsoft Academic Search

    M. Hasan Jamal; Ghulam Mustafa; Abdul Waheed; Waqar Mahmood

    2009-01-01

    With wide adoption of multi-core processor based systems, there is a need for benchmarking such systems at both application and operating system levels. Developing benchmarks for multi-core systems is a cumbersome task due to underlying parallel architecture and complexity of parallel programming paradigms. In this paper, we introduce multi-core processor architecture and communication (MPAC) benchmarking library, which provides a common

  17. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, D.B.

    1996-12-31

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

  18. Dynamic Load Distribution in the Borealis Stream Processor

    Microsoft Academic Search

    Ying Xing; Jeong-hyon Hwang

    2005-01-01

    Distributed and parallel computing environments are becoming cheap and commonplace. The availability of large numbers of CPU's makes it possible to process more data at higher speeds. Stream-processing systems a re also becoming more important, as broad classes of applic ations require results in real-time. Since load can vary in unpredictable ways, exploiti ng the abundant processor cycles requires effective

  19. An Acceleration Processor for Data Intensive Scientific Computing

    Microsoft Academic Search

    Cheong Ghil KIM; Hong-Sik KIM; Sungho KANG; Shin Dug KIM; Gunhee HAN

    2004-01-01

    SUMMARY Scientific computations for diffusion equations and ANNs (Artificial Neural Networks) are data intensive tasks accompanied by heavy memory access; on the other hand, their computational complexities are relatively low. Thus, this type of tasks naturally maps onto SIMD (Sin- gle Instruction Multiple Data stream) parallel processing with distributed memory. This paper proposes a high performance acceleration processor of which

  20. Automatic Architectural Synthesis of VLIW and EPIC Processors

    Microsoft Academic Search

    Shail Aditya; B. Ramakrishna Rau; Vinod Kathail

    1999-01-01

    This paper describes a mechanism for automatic design and synthesis of very long instruction word (VLIW), and its generalization, explicitly parallel instruction computing (EPIC) processor architectures starting from an abstract specification of their desired functionality. The process of architecture design makes concrete decisions regarding the number and types of functional units, number of read\\/write ports on register files, the data-path