Science.gov

Sample records for parallel processor array

  1. Titanic: a VLSI based content addressable parallel array processor

    SciTech Connect

    Weems, C.; Levitan, S.; Foster, C.

    1982-01-01

    A design is presented for a content addressable parallel array processor (CAPAP) which is both practical and feasible. Its practicality stems from an extensive program of research into real applications of content addressability and parallelism. The feasibility of the design stems from development under a set of conservative engineering constraints tied to limitations of VLSI technology. 1 ref.

  2. Digital Parallel Processor Array for Optimum Path Planning

    NASA Technical Reports Server (NTRS)

    Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

    1996-01-01

    The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

  3. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  4. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  5. Parallel Parsing of Context-Free Languages on an Array of Processors 

    E-print Network

    Langlois, Laurent Chevalier.

    1988-01-01

    Kosaraju [Kosaraju 69] and independently ten years later, Guibas, Kung and Thompson [Guibas 79] devised an algorithm (K-GKT) for solving on an array of processors a class of dynamic programming problems of which general ...

  6. Array processor architecture connection network

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1982-01-01

    A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.

  7. Spaceborne Processor Array

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  8. VLSI array processor

    NASA Astrophysics Data System (ADS)

    Greenwood, E.

    1982-07-01

    The Arithmetic Processor Unit (APU) data base design check was completed. Minor design rule violations and design improvements were accomplished. The APU mask set has been fabricated and checked. Initial checking of all mask layers revealed a design rule problem in one layer. That layer was corrected, refabricated and checked out. The mask set has been delivered to the chip fabrication area. The fabrication process has been initiated. All work on the Array Processor Demonstration System (APDS) has been suspended at CHI until the additionally requested funding was received. That funding has been authorized and CHI will begin work on the APDS in July. The following activities are planned in the following quarter: 1) Complete fabrication of the first lot of VLSI APU devices. 2) Complete integration and check-out of the APDS simulator. 3) Complete integration and check-out of the APU breadboard. 4) Verify the VLSI APU wafer tests with the APU breadboard. 5) Complete check-out of the APDS using the APU breadboard.

  9. Optical systolic array processor using residue arithmetic

    NASA Technical Reports Server (NTRS)

    Jackson, J.; Casasent, D.

    1983-01-01

    The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

  10. Parallel processor engine model program

    NASA Technical Reports Server (NTRS)

    Mclaughlin, P.

    1984-01-01

    The Parallel Processor Engine Model Program is a generalized engineering tool intended to aid in the design of parallel processing real-time simulations of turbofan engines. It is written in the FORTRAN programming language and executes as a subset of the SOAPP simulation system. Input/output and execution control are provided by SOAPP; however, the analysis, emulation and simulation functions are completely self-contained. A framework in which a wide variety of parallel processing architectures could be evaluated and tools with which the parallel implementation of a real-time simulation technique could be assessed are provided.

  11. Rectangular Array Of Digital Processors For Planning Paths

    NASA Technical Reports Server (NTRS)

    Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

    1993-01-01

    Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.

  12. Parallel Analog-to-Digital Image Processor

    NASA Technical Reports Server (NTRS)

    Lokerson, D. C.

    1987-01-01

    Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

  13. Design of a massively parallel processor

    NASA Technical Reports Server (NTRS)

    Batcher, K. E.

    1980-01-01

    The massively parallel processor (MPP) system is designed to process satellite imagery at high rates. A large number (16,384) of processing elements (PE's) are configured in a square array. For optimum performance on operands of arbitrary length, processing is performed in a bit-serial manner. On 8-bit integer data, addition can occur at 6553 million operations per second (MOPS) and multiplication at 1861 MOPS. On 32-bit floating-point data, addition can occur at 430 MOPS and multiplication at 216 MOPS.

  14. Array E Parts Application Analysis) Data Processor

    E-print Network

    Rathbun, Julie A.

    : ALSEP-E ASSEMBLY: Central Station Digital Data Processor DATE: 6-1-71 SUB ASSEMBLY: Non~LSEP Array E Parts Application Analysis) Data Processor ATM 955 I PAGE 1 OF 29 DATE 6 Processor. The Data Processor is completely redundant with the redundant cir- cuits in standby unenergized

  15. The Use of a Microcomputer Based Array Processor for Real Time Laser Velocimeter Data Processing

    NASA Technical Reports Server (NTRS)

    Meyers, James F.

    1990-01-01

    The application of an array processor to laser velocimeter data processing is presented. The hardware is described along with the method of parallel programming required by the array processor. A portion of the data processing program is described in detail. The increase in computational speed of a microcomputer equipped with an array processor is illustrated by comparative testing with a minicomputer.

  16. Adaptively Parallel Processor Allocation for Cilk Jobs

    E-print Network

    Sen, Siddhartha

    The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known ...

  17. Thread Scheduling Mechanisms for Multiple-Context Parallel Processors

    E-print Network

    Fiske, James A. Stuart

    1995-06-01

    Scheduling tasks to efficiently use the available processor resources is crucial to minimizing the runtime of applications on shared-memory parallel processors. One factor that contributes to poor processor utilization ...

  18. Parallel processor programs in the Federal Government

    NASA Technical Reports Server (NTRS)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  19. Assignment Of Finite Elements To Parallel Processors

    NASA Technical Reports Server (NTRS)

    Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

    1990-01-01

    Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.

  20. Binocular Disparity Calculation on a Massively-Parallel Analog Vision Processor

    E-print Network

    Mandal, Soumyajit

    We studied neuromorphic models of binocular disparity processing and mapped them onto a vision chip containing a massively parallel analog processor array. Our goal was to make efficient use of the available hardware while ...

  1. Fault-tolerant array processors using single-track switches

    SciTech Connect

    Kung, S.Y.; Jean, S.N.; Chang, C.W.

    1989-04-01

    An array processor is a collection of many similar processing elements (PED's), which can be executed in both parallel and pipeline processing. For the implementation of arrays of large number of processors, fault tolerance has always been a very critical design issue. Very often, spare PE's and switching lattices are incorporated in the array to improve the (fabrication-time) yield and the (run-time) reliability. In this paper, an array grid model based on single-track switches is proposed. A reconfigurability theorem is developed to provide the theoretical footing for new reconfiguration algorithms for the fabrication-time and run-time processing. For fabrication-time yield enhancement, the problem of finding a feasible reconfiguration using global control can be reformulated as a maximum independent set problem. An existing algorithm in graph theory is adopted to solve this problem.

  2. Phased array antenna beamforming using optical processor

    NASA Technical Reports Server (NTRS)

    Anderson, L. P.; Boldissar, F.; Chang, D. C. D.

    1991-01-01

    The feasibility of optical processor based beamforming for microwave array antennas is investigated. The primary focus is on systems utilizing the 20/30 GHz communications band and a transmit configuration exclusively to serve this band. A mathematical model is developed for computation of candidate design configurations. The model is capable of determination of the necessary design parameters required for spatial aspects of the microwave 'footprint' (beam) formation. Computed example beams transmitted from geosynchronous orbit are presented to demonstrate network capabilities. The effect of the processor on the output microwave signal to noise quality at the antenna interface is also considered.

  3. Intermediate-level computer-vision-processing algorithm development for the content-addressable-array parallel processor. Quarterly status report No. 3 for period ending 29 November 1986

    SciTech Connect

    Not Available

    1986-12-15

    During this quarter a set of seven benchmark problems were developed and analyzed for the IUA. These included Hough Transform, Convex Hull, Voronoi Diagram, Minimal Spanning Tree, Visibility of Vertices in a projected 3-dimensional model, subgraph isomorphism, and the minimum-cost path between points in a weighted graph. These problems are commonly considered intermediate-level processing in many visions research groups parallel implementations of UMass intermediate level processing algorithms, such as Boldt's line merging and Anandan's motion analysis continued to develop. A commercial processor, the TMS320C25, was chosen as the Intermediate Communications and Associative Processor (ICAP) processing element. The TMS320C25 has the advantages that it is a five-million instruction per second signal-processing unit with a fast multiplier and software support for fast floating-point operations. It also has a built in 5 Mb/S serial port that will interface well with the intermediate-level communications network. Also being explored is a set of group-theoretic network topologies with respect to the communication needs of intermediate-level processing. This has required the analysis of the classes of communication needed in each of the algorithms implemented.

  4. Associative massively parallel processor for video processing

    NASA Astrophysics Data System (ADS)

    Krikelis, Argy; Tawiah, T.

    1996-03-01

    Massively parallel processing architectures have matured primarily through image processing and computer vision application. The similarity of processing requirements between these areas and video processing suggest that they should be very appropriate for video processing applications. This research describes the use of an associative massively parallel processing based system for video compression which includes architectural and system description, discussion of the implementation of compression tasks such as DCT/IDCT, Motion Estimation and Quantization and system evaluation. The core of the processing system is the ASP (Associative String Processor) architecture a modular massively parallel, programmable and inherently fault-tolerant fine-grain SIMD processing architecture incorporating a string of identical APEs (Associative Processing Elements), a reconfigurable inter-processor communication network and a Vector Data Buffer for fully-overlapped data input-output. For video compression applications a prototype system is developed, which is using ASP modules to implement the required compression tasks. This scheme leads to a linear speed up of the computation by simply adding more APEs to the modules.

  5. Scalable Unix tools on parallel processors

    SciTech Connect

    Gropp, W.; Lusk, E.

    1994-12-31

    The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.

  6. Acceleration of computer-generated hologram by Greatly Reduced Array of Processor Element with Data Reduction

    NASA Astrophysics Data System (ADS)

    Sugiyama, Atsushi; Masuda, Nobuyuki; Oikawa, Minoru; Okada, Naohisa; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2014-11-01

    We have implemented a computer-generated hologram (CGH) calculation on Greatly Reduced Array of Processor Element with Data Reduction (GRAPE-DR) processors. The cost of CGH calculation is enormous, but CGH calculation is well suited to parallel computation. The GRAPE-DR is a multicore processor that has 512 processor elements. The GRAPE-DR supports a double-precision floating-point operation and can perform CGH calculation with high accuracy. The calculation speed of the GRAPE-DR system is seven times faster than that of a personal computer with an Intel Core i7-950 processor.

  7. Fast Hough Transform On A Mesh Connected Processor Array

    NASA Astrophysics Data System (ADS)

    Kannar, C. S.; Chuang, Henry Y. H.

    1988-02-01

    Hough transform is an effective method for the detection of the shape of object boundaries in image pattern analysis. Since the Hough transform is very computation intensive, it is essen-tial to parallelize the computation. However, an effective parallel algorithm is harder to obtain because it requires global informa-tion. In this paper we present an efficient parallel Hough transform algorithm for the detection of straight lines using mesh connected processor arrays. While other parallel algo-rithms take either 0(n2) or 0(n2) time, where n is the number of distinct values of a parameter and N is the number of edge pixels, our algorithm takes 0(n) time.

  8. Contextual classification on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Tilton, James C.

    1987-01-01

    Classifiers are often used to produce land cover maps from multispectral Earth observation imagery. Conventionally, these classifiers have been designed to exploit the spectral information contained in the imagery. Very few classifiers exploit the spatial information content of the imagery, and the few that do rarely exploit spatial information content in conjunction with spectral and/or temporal information. A contextual classifier that exploits spatial and spectral information in combination through a general statistical approach was studied. Early test results obtained from an implementation of the classifier on a VAX-11/780 minicomputer were encouraging, but they are of limited meaning because they were produced from small data sets. An implementation of the contextual classifier is presented on the Massively Parallel Processor (MPP) at Goddard that for the first time makes feasible the testing of the classifier on large data sets.

  9. Breadboard Signal Processor for Arraying DSN Antennas

    NASA Technical Reports Server (NTRS)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; Rayhrer, Benno

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  10. Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor

    E-print Network

    Scott, Michael L.

    Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor Thomas J. Le of Rochester have used a collection of BBN Butterfly TM Parallel Processors to conduct research in parallel with the Butterfly we have ported three compilers, developed five major and several minor library packages, built two

  11. Massively Parallel MRI Detector Arrays

    PubMed Central

    Keil, Boris; Wald, Lawrence L

    2013-01-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  12. Chemical network problems solved on NASA/Goddard's massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Cho, Seog Y.; Carmichael, Gregory R.

    1987-01-01

    The single instruction stream, multiple data stream Massively Parallel Processor (MPP) unit consists of 16,384 bit serial arithmetic processors configured as a 128 x 128 array whose speed can exceed that of current supercomputers (Cyber 205). The applicability of the MPP for solving reaction network problems is presented and discussed, including the mapping of the calculation to the architecture, and CPU timing comparisons.

  13. Scan line graphics generation on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1988-01-01

    Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

  14. DFT algorithms for bit-serial GaAs array processor architectures

    NASA Technical Reports Server (NTRS)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  15. Experience with a multiprocessor based on eight FPS 120B array processors

    SciTech Connect

    Bucher, I.Y.; Frederickson, P.O.; Moore, J.W.

    1981-01-01

    The rate of increase in the speed of monoprocessors is no longer keeping pace with the needs of the laboratory; accordingly, the use of parallel processors in large scientific computations is being investigated. As an initial experiment, a particle-in-cell plasma simulation was adapted to run on a star graph architecture consisting of a UNIVAC 1110 as hub, and up to eight Floating Point Systems AP120B array processors at the other vertices. Subdivision of tasks among processors and measured results are discussed.

  16. ProcessorEfficient Parallel Computation of Polynomial Greatest Common Divisors*

    E-print Network

    Kaltofen, Erich

    Processor­Efficient Parallel Computation of Polynomial Greatest Common Divisors* Erich Kaltofen@cs.rpi.edu Preliminary Report (July 1, 1989) 1. Introduction We present a parallel algebraic PRAM algorithm that can scheme on an algebraic circuit of size = O(n !+1 log(n)) and depth = O(log(n) 2 ) This more general

  17. Past and Futur Parallelism Challenges to Encompass Sequential Processor Evolution

    E-print Network

    Vialle, Stéphane

    This paper deals with parallelism evolution compared to sequential processor evolution following Moore's law: Moore's law, temporal advance of paral­ lelism, speed up models, portability, parallelism challenges. 1 computer performance doubles every 1.5 years, according to Moore's law[10]. So, most people pre­ fer

  18. Design and optimization of a defect tolerant processor array 

    E-print Network

    Lakkapragada, Bhavani S

    1995-01-01

    In this thesis we design and optimization of a defect tolerant MIMD processor array, for maximum performance per wafer area, targeted at applications that have a large number of operations per memory word, is described. ...

  19. Singular value decomposition utilizing parallel algorithms on graphical processors

    SciTech Connect

    Kotas, Charlotte W; Barhen, Jacob

    2011-01-01

    One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements for a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.

  20. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J. (Seattle, WA)

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  1. Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem

    NASA Astrophysics Data System (ADS)

    Sedukhin, Stanislav G.; Miyazaki, Toshiaki; Kuroda, Kenichi

    The algebraic path problem (APP) is a general framework which unifies several solution procedures for a number of well-known matrix and graph problems. In this paper, we present a new 3-dimensional (3-D) orbital algebraic path algorithm and corresponding 2-D toroidal array processors which solve the n × n APP in the theoretically minimal number of 3n time-steps. The coordinated time-space scheduling of the computing and data movement in this 3-D algorithm is based on the modular function which preserves the main technological advantages of systolic processing: simplicity, regularity, locality of communications, pipelining, etc. Our design of the 2-D systolic array processors is based on a classical 3-D?2-D space transformation. We have also shown how a data manipulation (copying and alignment) can be effectively implemented in these array processors in a massively-parallel fashion by using a matrix-matrix multiply-add operation.

  2. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    DOEpatents

    Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  3. Potential of minicomputer/array-processor system for nonlinear finite-element analysis

    NASA Technical Reports Server (NTRS)

    Strohkorb, G. A.; Noor, A. K.

    1983-01-01

    The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.

  4. Global synchronization of parallel processors using clock pulse width modulation

    DOEpatents

    Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

    2013-04-02

    A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

  5. Mesh-connected processor arrays for the transitive closure problem

    NASA Technical Reports Server (NTRS)

    Rao, S. K.; Citron, T.; Kailath, T.

    1985-01-01

    The main purpose in this paper is to lay a theoretical foundation for the design of mesh-connected processor arrays for the transitive closure problem. Using a simple path-algebraic formulation of the problem and observing its similarity to certain well-known smoothing problems that occur in digital signal processing, it is shown how to draw upon existing techniques from the signal processing literature to derive regular iterative algorithms for determining the transitive closure of the graph. The regular iterative algorithms that are derived using these considerations, are then analyzed and synthesized on mesh-connected processor arrays. Among the vast number of mesh-connected processor arrays that can be designed using this unified approach, the systolic arrays reported in the literature for this problem are shown to be special cases.

  6. Adaptive domain decomposition for Monte Carlo simulations on parallel processors

    NASA Technical Reports Server (NTRS)

    Wilmoth, Richard G.

    1990-01-01

    A method is described for performing direct simulation Monte Carlo (DSMC) calculations on parallel processors using adaptive domain decomposition to distribute the computational work load. The method has been implemented on a commercially available hypercube and benchmark results are presented which show the performance of the method relative to current supercomputers. The problems studied were simulations of equilibrium conditions in a closed, stationary box, a two-dimensional vortex flow, and the hypersonic, rarefield flow in a two-dimensional channel. For these problems, the parallel DSMC method ran 5 to 13 times faster than on a single processor of a Cray-2. The adaptive decomposition method worked well in uniformly distributing the computational work over an arbitrary number of processors and reduced the average computational time by over a factor of two in certain cases.

  7. Adaptive domain decomposition for Monte Carlo simulations on parallel processors

    NASA Technical Reports Server (NTRS)

    Wilmoth, Richard G.

    1991-01-01

    A method is described for performing direct simulation Monte Carlo (DSMC) calculations on parallel processors using adaptive domain decomposition to distribute the computational work load. The method has been implemented on a commercially available hypercube and benchmark results are presented which show the performance of the method relative to current supercomputers. The problems studied were simulations of equilibrium conditions in a closed, stationary box, a two-dimensional vortex flow, and the hypersonic, rarefied flow in a two-dimensional channel. For these problems, the parallel DSMC method ran 5 to 13 times faster than on a single processor of a Cray-2. The adaptive decomposition method worked well in uniformly distributing the computational work over an arbitrary number of processors and reduced the average computational time by over a factor of two in certain cases.

  8. Dynamic overset grid communication on distributed memory parallel processors

    NASA Technical Reports Server (NTRS)

    Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

    1993-01-01

    A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.

  9. Real-time trajectory optimization on parallel processors

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  10. A taxonomy of reconfiguration techniques for fault-tolerant processor arrays--

    SciTech Connect

    Chean, M. ); Fortes, J.A.B. )

    1990-01-01

    The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.

  11. Mapping Radiosity Computations to Parallel Processors.

    NASA Astrophysics Data System (ADS)

    Singh, Gautam Bir

    The radiosity method for rendering scenes is gaining popularity because of its ability to accurately model the energy distribution in an environment. As this photonic energy distribution is independent of the viewer's position, generating scenes for different viewpoints only requires hidden surface removal and can be performed in real-time. This makes it more attractive than ray tracing as a technique for modeling illumination. It is quite conceivable that radiosity method will be used for applications in scientific visualization, lighting simulations, CAD/CAM, virtual reality, and medical imaging. Computing radiosity of a scene with moderate to high complexity is tantamount to solving a system of tens of thousands of linear equations. Iterative linear system solvers, such as Gauss-Seidel, Jacobi, or conjugate descent, are quite demanding for a system of equations this large. An alternate approach, known as progressive refinement, offers some computational tractability and delivers an approximate solution relatively quickly. This dissertation presents the results of partitioning the radiosity computation to suitably map on a variety of multiprocessor classes. The effect of problem decomposition on computation and communication components is studied for the shared memory, the message passing and the loosely coupled distributed memory multiprocessors. Kendall Square Research's KSR1 and Intel hypercube iPSC/860 were used for experimenting with the shared memory and message-passing algorithms respectively. A network of IBM RS/6000 was used for understanding coarse grain parallelization techniques. These experiments demonstrated that optimality of parallel algorithms must be considered as a < machine, algorithm > pair. Thus the notion of program portability must also take machine architecture in consideration beside allowing for software compatibility. As the number of polygons for processing complex scenes continues to grow, the subdivision in the object space become increasingly important. An adaptive technique for binary subdivision of the object space is outlined and used in all the experiments. The resulting tree has a better balance as compared to the conventional techniques. A multiprocessor architecture that utilizes the object space subdivision and uses the token driven dataflow computation model is proposed as a hardware solution for radiosity. The proposed architecture is targeted toward the high end workstations which can benefit from the proposed design in performing radiosity computation and other similar tasks.

  12. Frequency-multiplexed and pipelined iterative optical systolic array processors

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Jackson, J.; Neuman, C.

    1983-01-01

    Optical matrix processors using acoustooptic transducers are described, with emphasis on new systolic array architectures using frequency multiplexing in addition to space and time multiplexing. A Kalman filtering application is considered in a case study from which the operations required on such a system can be defined. This also serves as a new and powerful application for iterative optical processors. The importance of pipelining the data flow and the ordering of the operations performed in a specific application of such a system are also noted. Several examples of how to effectively achieve this are included. A new technique for handling bipolar data on such architectures is also described.

  13. Optimal evaluation of array expressions on massively parallel machines

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

    1992-01-01

    We investigate the problem of evaluating FORTRAN 90 style array expressions on massively parallel distributed-memory machines. On such machines, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression. The choice of where to perform the operation then affects this cost. We present algorithms based on dynamic programming to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider expressions containing operations that change the shape of the arrays, and show that our approach extends naturally to handle this case.

  14. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  15. Analog parallel processor hardware for high speed pattern recognition

    NASA Technical Reports Server (NTRS)

    Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

    1990-01-01

    A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

  16. Optimal mapping of irregular finite element domains to parallel processors

    NASA Technical Reports Server (NTRS)

    Flower, J.; Otto, S.; Salama, M.

    1987-01-01

    Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

  17. QPACE -- a QCD parallel computer based on Cell processors

    E-print Network

    H. Baier; H. Boettiger; M. Drochner; N. Eicker; U. Fischer; Z. Fodor; A. Frommer; C. Gomez; G. Goldrian; S. Heybrock; D. Hierl; M. Hüsken; T. Huth; B. Krill; J. Lauritsen; T. Lippert; T. Maurer; B. Mendl; N. Meyer; A. Nobile; I. Ouda; M. Pivanti; D. Pleiter; M. Ries; A. Schäfer; H. Schick; F. Schifano; H. Simma; S. Solbrig; T. Streuer; K. -H. Sulanke; R. Tripiccione; J. -S. Vogt; T. Wettig; F. Winter

    2009-12-24

    QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

  18. VLSI array processor R&D status report

    NASA Astrophysics Data System (ADS)

    Greenwood, E.

    1982-01-01

    Detail design of the Arithmetic Processor Unit (APU) chip has been completed. All cell types (100) have been run through the design rule check (DRC) programs, corrected and verified. DRC runs on the entire chip have been run and all corrections have been made. Fifteen out of eighteen of the chip DRC corrections have been verified. The metal, polysilicon and information data layers of the APU layout is shown. The attached drawings, titled 'VLSI Array Processor Arithmetic Processor Unit Chip Plan' is a detail drawing of the APU Chip Plan. The functional level simulator of the APU has been built and verified using a set of APU diagnostic code. A gate level logic simulation of the APU has been built. The APU breadboard modules have been fabricated and check out has been initiated. The Array Processor Demonstration System (APDS) modules are in the wire-wrap process. The APDS and APU microcode assembler have been built and checked out. The linker and loader for the APDS have also been built.

  19. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1990-01-01

    Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.

  20. Bit-parallel arithmetic in a massively-parallel associative processor

    NASA Technical Reports Server (NTRS)

    Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

    1992-01-01

    A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.

  1. Prototype Focal-Plane-Array Optoelectronic Image Processor

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey

    1995-01-01

    Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.

  2. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  3. On program restructuring, scheduling, and communication for parallel processor systems

    SciTech Connect

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  4. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  5. Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

    NASA Technical Reports Server (NTRS)

    Chen, Paul Peichuan

    1993-01-01

    Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

  6. Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

    1997-01-01

    A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

  7. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, Robert J. (Pleasanton, CA); Brooks, III, Eugene D. (Livermore, CA); Haigh, Ronald E. (Tracy, CA); DeGroot, Anthony J. (Castro Valley, CA)

    1999-01-01

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

  8. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

    1999-08-24

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

  9. Design of a dataway processor for a parallel image signal processing system

    NASA Astrophysics Data System (ADS)

    Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

    1995-04-01

    Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

  10. Application of the hypercube parallel processor to a large-scale moment method code

    NASA Technical Reports Server (NTRS)

    Manshadi, Farzin; Liewer, Paulet C.; Patterson, Jean E.

    1988-01-01

    The applicability of a parallel computing architecture to the solution of a large-scale moment-method code is investigated. Specifically, the NEC (Numerical Electromagnetics Code) method-of-moments scattering program is implemented on a hypercube parallel processor. The accuracy and the increase in the speed of execution on this parallel architecture are demonstrated. The results show a very large reduction in execution time for large problems. The great potential of this parallel processor is shown for interactive solution of large NEC problems as well as other moment-method techniques such as the finite-element method.

  11. A New Argus Direct Conversion Receiver and Digital Array Receiver/Processor

    E-print Network

    Ellingson, Steven W.

    /Processor Source Code 22 Appendix C: PCI-DIO-32HS Controller Source Code 26 1 #12;1 The New Argus Array Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 References 20 Appendix A: Direct Conversion FPGA Source Code 21 Appendix B: Digital Receiver

  12. Evaluation of fault-tolerant parallel-processor architectures over long space missions

    NASA Technical Reports Server (NTRS)

    Johnson, Sally C.

    1989-01-01

    The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.

  13. Using algebra for massively parallel processor design and utilization

    NASA Technical Reports Server (NTRS)

    Campbell, Lowell; Fellows, Michael R.

    1990-01-01

    This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

  14. High-speed Systolic Array Processor (HISSAP) system development synopsis: Lesson learned. Final report, Oct 83-Oct 90

    SciTech Connect

    Loughlin, J.P.

    1991-05-01

    This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic and software mapping tools is described. Based on the experience gained in the mapping of matrix-based algorithms on the testbed hardware, specific recommendations are presented in the form of lessons learned, which are intended to offer guidance in the development of future Navy signal processing systems.

  15. Reduction of solar vector magnetograph data using a microMSP array processor

    NASA Technical Reports Server (NTRS)

    Kineke, Jack

    1990-01-01

    The processing of raw data obtained by the solar vector magnetograph at NASA-Marshall requires extensive arithmetic operations on large arrays of real numbers. The objectives of this summer faculty fellowship study are to: (1) learn the programming language of the MicroMSP Array Processor and adapt some existing data reduction routines to exploit its capabilities; and (2) identify other applications and/or existing programs which lend themselves to array processor utilization which can be developed by undergraduate student programmers under the provisions of project JOVE.

  16. Aligning parallel arrays to reduce communication

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

    1994-01-01

    Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

  17. Array distribution in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  18. Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Gibson, Garth Alan

    1990-01-01

    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

  19. Application of a dynamically reconfigurable cell-array processor to an MPEG-2 video decoder

    NASA Astrophysics Data System (ADS)

    Komoku, Kiyotaka; Hatano, Fumihiro; Morishita, Takayuki; Teramoto, Iwao

    2000-10-01

    We have proposed and developed the Dynamically Reconfigurable Cell-Array Processor (DRCAP) that consists of functional Cell Arrays (CAs), and buses/bus-switches that provide with connections between CAs. A software simulator of the DRCAP is constructed, on which the MPEG-2 video decoder is successfully implemented. This MPEG-2 decoder dynamically changes the configuration in many times during the decoding process. The processing is executed every macro-block, reconfiguring in each component process of the MPEG-2 decoding such as the variable length decoding, the dequantization, the inverse DCT, and so on. The resources required for the DRCAP to decode the MPEG-2 MP@ML video stream is investigated. In the simulation it is found that the numbers of CAs to decode the MPEG-2 MP@ML video stream are 8 for PCAs, 1 for LCA, 2 for CCAs and 35 for MCAs, and the execution cycle required is 94.6MHz. In the case of doubling all configurations, where the same two processes are executed in parallel, the numbers of CAs are 15, 1, 4 and 69, for PCA, LCA, CCA and MCA, respectively, and the execution frequency of 55.9MHz is required.

  20. Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (inventor); Bejczy, Antal K. (inventor)

    1994-01-01

    In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  1. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, T.

    1986-01-01

    A nonlinear structural dynamics program with an element library that exploits parallel processing is under development. The aim is to exploit scheduling-allocation so that parallel processing and vectorization can effectively be treated in a general purpose program. As a byproduct an automatic scheme for assigning time steps was devised. A rudimentary form of the program is complete and has been tested; it shows substantial advantage can be taken of parallelism. In addition, a stability proof for the subcycling algorithm has been developed.

  2. Parallel ART for image reconstruction in CT using processor DAN GORDON*

    E-print Network

    Gordon, Dan

    January 2006) Algebraic Reconstruction Technique (ART) is a widely-used iterative method for solving with a small relaxation parameter produces excellent results. It is shown that for this particular problem, ARTParallel ART for image reconstruction in CT using processor arrays DAN GORDON* Department

  3. Scheduling Two Classes of Exponential Jobs on Parallel Processors: Structural Results and Worst Case Analysis

    E-print Network

    Chang, Cheng-Shang

    Scheduling Two Classes of Exponential Jobs on Parallel Processors: Structural Results and Worst classes of jobs. We assume that all jobs are present at time 0 and there are no further arrivals. The service times of class 1 (2) jobs are independent and exponentially distributed with mean #22; 1 1 (#22; 2

  4. Binocular Disparity Calculation on a Massively-Parallel Analog Vision Processor

    E-print Network

    Dudek, Piotr

    Binocular Disparity Calculation on a Massively-Parallel Analog Vision Processor Soumyajit Mandal neuromorphic models of binocular dis- parity processing and mapped them onto a vision chip containing two horizontally-separated virtual cameras, thereby allowing us to run our binocular disparity

  5. Parallel Data Mining for Association Rules on Shared-memory Multi-processors

    E-print Network

    Zaki, Mohammed Javeed

    Parallel Data Mining for Association Rules on Shared-memory Multi-processors M. J. Zaki, M. Ogihara algorithms have been proposed for data mining of association rules. However, research so far has mainly. In this paper we will concentrate on data mining for association rules. Application domains for association

  6. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1989-01-01

    A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

  7. Series-parallel method of direct solar array regulation

    NASA Technical Reports Server (NTRS)

    Gooder, S. T.

    1976-01-01

    A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

  8. Impact of shipping Ball-Grid-Array Notebook processors in tape and reel on the PC supply chain

    E-print Network

    Chuang, Pamela

    2012-01-01

    Today, approximately 90% of Intel notebook processors are packaged in PGA (Pin Grid Array) and 10% are packaged in BGA (Ball Grid Array). Intel has recently made a decision to transform the notebook industry by creating a ...

  9. Evaluation of a simplified version of KENO V. a on a parallel processors computer

    SciTech Connect

    Ugolini, D.; Petrie, L.M.; Dodds, H.L. Jr.

    1987-01-01

    KENO V.a is a widely used Monte Carlo criticality code developed by Oak Ridge National Laboratory for use primarily on large single processor mainframe computers. The code can be very costly to use if a large number of histories is required because the histories are performed sequentially via the single processor. With the advent of parallel processor computers, it should be possible to reduce computing costs (i.e., computer run time) by performing the histories in parallel. The purposes of this work is to implement KENO V.a on a parallel processor computer, specifically the NCUBE and then to compare results obtained on the NCUBE (i.e., accuracy and computing time) with results obtained on a large mainframe computer (IBM 3033). The NCUBE is a message-passing machine with no shared memory. A simplified version of KENO V.a was developed for this study because the standard version was too large to compile on the NCUBE. In addition, a special 1-group cross-section library, reduced from the standard 16-group Hansen Roach Library, was also used. The sample problem used in this study was an 18-cm-diam sphere of /sup 235/U at 0.05 atom/b x cm.

  10. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

    1989-01-01

    The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

  11. Parallel collective resonances in arrays of gold nanorods.

    PubMed

    Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

    2014-01-01

    In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically. PMID:24645987

  12. Implementation of context independent code on a new array processor: The Super-65

    NASA Technical Reports Server (NTRS)

    Colbert, R. O.; Bowhill, S. A.

    1981-01-01

    The feasibility of rewriting standard uniprocessor programs into code which contains no context-dependent branches is explored. Context independent code (CIC) would contain no branches that might require different processing elements to branch different ways. In order to investigate the possibilities and restrictions of CIC, several programs were recoded into CIC and a four-element array processor was built. This processor (the Super-65) consisted of three 6502 microprocessors and the Apple II microcomputer. The results obtained were somewhat dependent upon the specific architecture of the Super-65 but within bounds, the throughput of the array processor was found to increase linearly with the number of processing elements (PEs). The slope of throughput versus PEs is highly dependent on the program and varied from 0.33 to 1.00 for the sample programs.

  13. Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition

    DOEpatents

    Tomkins, James L. (Albuquerque, NM); Camp, William J. (Albuquerque, NM)

    2007-07-17

    A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

  14. A unified approach to VLSI layout automation and algorithm mapping on processor arrays

    NASA Technical Reports Server (NTRS)

    Venkateswaran, N.; Pattabiraman, S.; Srinivasan, Vinoo N.

    1993-01-01

    Development of software tools for designing supercomputing systems is highly complex and cost ineffective. To tackle this a special purpose PAcube silicon compiler which integrates different design levels from cell to processor arrays has been proposed. As a part of this, we present in this paper a novel methodology which unifies the problems of Layout Automation and Algorithm Mapping.

  15. Construction of a parallel processor for simulating manipulators and other mechanical systems

    NASA Technical Reports Server (NTRS)

    Hannauer, George

    1991-01-01

    This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

  16. Transmissive Nanohole Arrays for Massively-Parallel Optical Yanan Wang,

    E-print Network

    Bao, Jiming

    Transmissive Nanohole Arrays for Massively-Parallel Optical Biosensing Yanan Wang, Archana Kar technique combines optical transmission of nanoholes with colorimetric silver staining. The size and spacing of the nanoholes are chosen so that individual nanoholes can be independently resolved in massive parallel using

  17. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    SciTech Connect

    Cancelo, G.; Gottschalk, Erik Edward; Pavlicek, V.; Wang, M.; Wu, J.

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  18. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

    SciTech Connect

    Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.; Catalyurek, Umit V.; Kurc, Tahsin; Sadayappan, Ponnuswamy; Saltz, Joel H.

    2009-08-01

    Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.

  19. Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2011-04-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

  20. Application of an array processor to the analysis of magnetic data for the Doublet III tokamak

    SciTech Connect

    Wang, T.S.; Saito, M.T.

    1980-08-01

    Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed.

  1. Fast structural design and analysis via hybrid domain decomposition on massively parallel processors

    NASA Technical Reports Server (NTRS)

    Farhat, Charbel

    1993-01-01

    A hybrid domain decomposition framework for static, transient and eigen finite element analyses of structural mechanics problems is presented. Its basic ingredients include physical substructuring and /or automatic mesh partitioning, mapping algorithms, 'gluing' approximations for fast design modifications and evaluations, and fast direct and preconditioned iterative solvers for local and interface subproblems. The overall methodology is illustrated with the structural design of a solar viewing payload that is scheduled to fly in March 1993. This payload has been entirely designed and validated by a group of undergraduate students at the University of Colorado using the proposed hybrid domain decomposition approach on a massively parallel processor. Performance results are reported on the CRAY Y-MP/8 and the iPSC-860/64 Touchstone systems, which represent both extreme parallel architectures. The hybrid domain decomposition methodology is shown to outperform leading solution algorithms and to exhibit an excellent parallel scalability.

  2. Parallel arrays of Josephson junctions for submillimeter local oscillators

    NASA Technical Reports Server (NTRS)

    Pance, Aleksandar; Wengler, Michael J.

    1992-01-01

    In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

  3. Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

    SciTech Connect

    Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

    2010-01-01

    An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

  4. An Investigation into Reliability, Availability, and Serviceability (RAS) Features for Massively Parallel Processor Systems

    SciTech Connect

    KELLY, SUZANNE M.; OGDEN, JEFFREY BRANDON

    2002-10-01

    A study has been completed into the RAS features necessary for Massively Parallel Processor (MPP) systems. As part of this research, a use case model was built of how RAS features would be employed in an operational MPP system. Use cases are an effective way to specify requirements so that all involved parties can easily understand them. This technique is in contrast to laundry lists of requirements that are subject to misunderstanding as they are without context. As documented in the use case model, the study included a look at incorporating system software and end-user applications, as well as hardware, into the RAS system.

  5. Block iterative restoration of astronomical images with the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Heap, Sara R.; Lindler, Don J.

    1987-01-01

    A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

  6. Estimating water flow through a hillslope using the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Devaney, Judy E.; Camillo, P. J.; Gurney, R. J.

    1988-01-01

    A new two-dimensional model of water flow in a hillslope has been implemented on the Massively Parallel Processor at the Goddard Space Flight Center. Flow in the soil both in the saturated and unsaturated zones, evaporation and overland flow are all modelled, and the rainfall rates are allowed to vary spatially. Previous models of this type had always been very limited computationally. This model takes less than a minute to model all the components of the hillslope water flow for a day. The model can now be used in sensitivity studies to specify which measurements should be taken and how accurate they should be to describe such flows for environmental studies.

  7. Animated computer graphics models of space and earth sciences data generated via the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

    1987-01-01

    The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

  8. Basic data-base operations on the Butterfly Parallel Processor: experiment results. Memorandum report, January-December 1987

    SciTech Connect

    Rosenau, T.J.; Jajodia, S.

    1988-03-04

    The next phase in speeding up data-base queries will be through the use of highly parallel computers. This paper will discuss the basic data-base operations (select, project, natural join, and scaler aggregates) on a shared-memory multiple instruction stream, multiple data stream (MIMD) computer and the problems associated with implementing them. Some problems associated with getting maximum parallelization are improper data division and hot spots. Improper data division results when the number of tasks does not divide evenly among the processors. Hot spots or contentions occur due to locking if accesses are made to the same segment of a RAMFile and also if attempts are made to get data from the same remote processor at the same time. These algorithms have been implemented on the Butterfly Parallel Processor, and the results of our experiments are described in detail.

  9. Mobile and replicated alignment of arrays in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  10. Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Dimpsey, Robert Tod

    1992-01-01

    In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.

  11. Parallel computation of optimized arrays for 2-D electrical imaging surveys

    NASA Astrophysics Data System (ADS)

    Loke, M. H.; Wilkinson, P. B.; Chambers, J. E.

    2010-12-01

    Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman-Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix-vector multiplications for a single add-on array were replaced with matrix-matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers.

  12. Parallel optical interconnects with mixed-signal OEIC and fibre arrays for high-speed communication

    NASA Astrophysics Data System (ADS)

    Fey, Dietmar; Hoppe, Lutz; Loos, Andreas; Fortsch, Michael; Zimmermann, Horst

    2004-09-01

    We present a system for direct parallel optical data communication between integrated circuits on neighboured printed circuit boards based on a monolithic integrated CMOS smart pixel array, fibre arrays, and VCSELs. The advantage of our system versus backplane systems is the direct data transfer through the space avoiding planar and area consuming interconnections. The detector chip allows a data rate of 625 Mbit/s per link and is cycled by an optical clock. A simulation of the chip layout showed 260 % more performance versus electrical off-chip interconnects. In principle an 8'8 data transfer is feasible allowing a data rate of 40 Gbit/s. The detector combines an optical receiver array with a digital processor array which executes image processing algorithms. The optical receiver is formed by a PIN photodiode with a diameter of 40 ?m, a transimpedance amplifier (TIA) and a decision-making postamplifier. The measured responsivity of the photodiode without antireflection coating is R=0.382 A/W at an optical wavelength of 670 nm. The TIA consists of a CMOS inverter and a PMOS transistor forming the feedback resistor. Together with the postamplifier, formed by a chain of five CMOS inverters and attaining digital CMOS levels, a data rate of 625 Mbit/s is achieved.

  13. Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications

    NASA Technical Reports Server (NTRS)

    Bruno, John

    1984-01-01

    The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.

  14. NOSC (Naval Ocean Systems Center) advanced systolic array processor (ASAP). Professional paper for period ending August 1987

    SciTech Connect

    Loughlin, J.P.

    1987-12-01

    Design of a high-speed (250 million 32-bit floating-point operations per second) two-dimensional systolic array composed of 16-bit/slice microsequencer structured processors is presented. System-design features such as broadcast data flow, tag bit movement, and integrated diagnostic test registers are described. The software development tools needed to map complex matrix-based signal-processing algorithms onto the systolic-processor system are described.

  15. Evaluation of the Intel iWarp parallel processor for space flight applications

    NASA Technical Reports Server (NTRS)

    Hine, Butler P., III; Fong, Terrence W.

    1993-01-01

    The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.

  16. On-board landmark navigation and attitude reference parallel processor system

    NASA Technical Reports Server (NTRS)

    Gilbert, L. E.; Mahajan, D. T.

    1978-01-01

    An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.

  17. High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

    DOEpatents

    Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

    2002-01-01

    As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

  18. Parallel Syntheses of Peptides on Teflon-Patterned Paper Arrays (SyntArrays).

    PubMed

    Deiss, Frédérique; Yang, Yang; Derda, Ratmir

    2016-01-01

    Screening of peptides to find the ligands that bind to specific targets is an important step in drug discovery. These high-throughput screens require large number of structural variants of peptides to be synthesized and tested. This chapter describes the generation of arrays of peptides on Teflon-patterned sheets of paper. First, the protocol describes the patterning of paper with a Teflon solution to produce arrays with solvophobic barriers that are able to confine organic solvents. Next, we describe the parallel syntheses of 96 peptides on Teflon-patterned arrays using the SPOT synthesis method. PMID:26614081

  19. Microchannel cross load array with dense parallel input

    DOEpatents

    Swierkowski, Stefan P.

    2004-04-06

    An architecture or layout for microchannel arrays using T or Cross (+) loading for electrophoresis or other injection and separation chemistry that are performed in microfluidic configurations. This architecture enables a very dense layout of arrays of functionally identical shaped channels and it also solves the problem of simultaneously enabling efficient parallel shapes and biasing of the input wells, waste wells, and bias wells at the input end of the separation columns. One T load architecture uses circular holes with common rows, but not columns, which allows the flow paths for each channel to be identical in shape, using multiple mirror image pieces. Another T load architecture enables the access hole array to be formed on a biaxial, collinear grid suitable for EDM micromachining (square holes), with common rows and columns.

  20. Parallel vacuum arc discharge with microhollow array dielectric and anode

    SciTech Connect

    Feng, Jinghua; Zhou, Lin; Fu, Yuecheng; Zhang, Jianhua; Xu, Rongkun; Chen, Faxin; Li, Linbo; Meng, Shijian

    2014-07-15

    An electrode configuration with microhollow array dielectric and anode was developed to obtain parallel vacuum arc discharge. Compared with the conventional electrodes, more than 10 parallel microhollow discharges were ignited for the new configuration, which increased the discharge area significantly and made the cathode eroded more uniformly. The vacuum discharge channel number could be increased effectively by decreasing the distances between holes or increasing the arc current. Experimental results revealed that plasmas ejected from the adjacent hollow and the relatively high arc voltage were two key factors leading to the parallel discharge. The characteristics of plasmas in the microhollow were investigated as well. The spectral line intensity and electron density of plasmas in microhollow increased obviously with the decease of the microhollow diameter.

  1. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGESBeta

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; Anderson, Steve; Woodward, Paul; Dietz, Hank

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore »have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  2. Implementation of VQ algorithms on a reconfigurable array processor. Professional paper

    SciTech Connect

    Henderson, T.B.; Thyagarajan, K.S.

    1991-05-01

    Vector quantization is being widely used in image data compression applications due to the facts that it is capable of achieving fractional bit rates with reasonable complexity and that the decoding is a very simple table look-up scheme. In image encoding, a vector quantizer accepts a block of pixels and outputs an address of the best matching tile stored in a codebook. The matching algorithm requires a large number of basic arithmetic operations in typical applications. Since real time coding is required in many video applications, the need for dedicated processing architectures arises naturally. This paper investigates the mapping of VQ algorithms onto an array processor to achieve near real-time compression of video images.

  3. Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor

    NASA Technical Reports Server (NTRS)

    Field, E. I.

    1975-01-01

    The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.

  4. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    NASA Technical Reports Server (NTRS)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  5. Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems

    NASA Technical Reports Server (NTRS)

    Das, T. P.

    1988-01-01

    The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

  6. A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array.

    PubMed

    Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

    2013-05-01

    A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ~27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field. PMID:23742570

  7. APE project: a Gigaflop processor for lattice calculations

    SciTech Connect

    Bacilieri, P.; Cabasino, S.; Marzano, F.; Paohicci, P.; Petrarce, S.; Salina, G.; Cabibo, H.; Giovannella, C.; Marinari, E.; Parisi, G.

    1985-07-01

    A new special purpose parallel processor (APE) presently under development is presented. The theoretical computing power of the processor is 1 Giga-Flop and the memory can be expanded to 512 Mega-bytes. Sixteen 52 bit floating point processors each with a computing power of 64 Mega-Flops are driven in parallel as a single instruction multiple data machine under the control of a 3081/E. Each floating point unit is connected to two 8 Mega-byte memories which can also be accessed by the 3081.E. Though this machine can be used as a general purpose array processor the hardware has been optimized for lattice QCD calculations.

  8. Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986

    SciTech Connect

    Golub, G.H.

    1987-04-30

    The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

  9. Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor

    NASA Technical Reports Server (NTRS)

    Moore, J. Strother

    1992-01-01

    Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

  10. A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor

    NASA Astrophysics Data System (ADS)

    Sotiropoulou, C. L.; Gkaitatzis, S.; Annovi, A.; Beretta, M.; Kordas, K.; Nikolaidis, S.; Petridou, C.; Volpi, G.

    2014-10-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of the single flow implementation on the custom FTK input mezzanine (IM) board are presented. We report on the integration of 16 parallel engines in the same FPGA and the resulting performances. The parallel 2D-clustering implementation has sufficient processing power to meet the specification for the Pixel layers of ATLAS, for up to 80 overlapping pp collisions that correspond to the maximum LHC luminosity planned until 2022.

  11. Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank

    NASA Astrophysics Data System (ADS)

    Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

    2014-05-01

    Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

  12. Design and numerical evaluation of a volume coil array for parallel MR imaging at ultrahigh fields

    PubMed Central

    Pang, Yong; Wong, Ernest W.H.; Yu, Baiying

    2014-01-01

    In this work, we propose and investigate a volume coil array design method using different types of birdcage coils for MR imaging. Unlike the conventional radiofrequency (RF) coil arrays of which the array elements are surface coils, the proposed volume coil array consists of a set of independent volume coils including a conventional birdcage coil, a transverse birdcage coil, and a helix birdcage coil. The magnetic fluxes of these three birdcage coils are intrinsically cancelled, yielding a highly decoupled volume coil array. In contrast to conventional non-array type volume coils, the volume coil array would be beneficial in improving MR signal-to-noise ratio (SNR) and also gain the capability of implementing parallel imaging. The volume coil array is evaluated at the ultrahigh field of 7T using FDTD numerical simulations, and the g-factor map at different acceleration rates was also calculated to investigate its parallel imaging performance. PMID:24649435

  13. Scalable Unix commands for parallel processors : a high-performance implementation.

    SciTech Connect

    Ong, E.; Lusk, E.; Gropp, W.

    2001-06-22

    We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

  14. Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2012-01-01

    The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

  15. Multimode power processor

    DOEpatents

    O'Sullivan, G.A.; O'Sullivan, J.A.

    1999-07-27

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

  16. Multimode power processor

    DOEpatents

    O'Sullivan, George A. (Pottersville, NJ); O'Sullivan, Joseph A. (St. Louis, MO)

    1999-01-01

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

  17. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    NASA Technical Reports Server (NTRS)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  18. Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

    SciTech Connect

    Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

    2005-09-20

    At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

  19. Research of control system stability in solar array simulator with continuous power amplifier of parallel type

    NASA Astrophysics Data System (ADS)

    Mizrah, E. A.; Tkachev, S. B.; Shtabel, N. V.

    2015-10-01

    Solar array simulators are nonlinear control systems designed to reproduce static and dynamic characteristics of solar array. Solar array characteristics depend on illumination, temperature, space environment and other causes. During on-earth testing of spacecraft power systems there is a problem reaching stable work of simulator with different impedance loads in wide range load regulation. In the article authors propose a research method for absolute process stability in solar array simulators and present results of absolute stability research for solar array simulator with continuous parallel type power amplifier.

  20. High-performance computational chemistry : hartree-fock electronic structure calculations on massively parallel processors.

    SciTech Connect

    Tilson, J. L.; Minkoff, M.; Wagner, A. F.; Shepard, R.; Sutton, P.; Harrison, R. J.; Kendall, R. A.; Wong, A. T.; PNNL

    1999-01-01

    The parallel performance of the NWChem version 1.2{alpha} parallel direct-SCF code has been characterized on five massively parallel supercomputers (IBM SP, Kendall Square KSR-2, CRAY T3D and T3E, and Intel Touchstone DELTA) using single-point energy calculations on seven molecules of varying size (up to 389 atoms) and composition (first-row atoms, halogens, and transition metals). The authors compare the performance using both replicated-data and distributed-data algorithms and the original McMurchie-Davidson and recently incorporated TEXAS integrals packages.

  1. The Panda Array I/O library on the Galley Parallel File System

    E-print Network

    1 The Panda Array I/O library on the Galley Parallel File System Joel T. Thomas Dartmouth Computer Joel.T.Thomas@dartmouth.edu June 5, 1996 Abstract The Panda Array I/O library, created some time, and the Panda project is an attempt to ameliorate this problem while still providing

  2. General-purpose 128 128 SIMD processor array with integrated image sensor

    E-print Network

    Dudek, Piotr

    processing elements (APEs). While these processors are implemented using analogue circuitry, performing arithmetic and logic operations on data stored in the local memory. Each APE includes nine of data. APEs can communicate and exchange data with four nearest neigh- bours. All APEs execute identical

  3. Architecture of a VLSI cellular processor array for synchronous/asynchronous image processing

    E-print Network

    Dudek, Piotr

    be represented by a wave-propagation process in which pixel operations are triggered by a change of state to the regularity of image data, the pixel-per-processor approach is particularly efficient in low-level image the result pixel value is a function of the neighbouring pixels and the current pixel value. There are

  4. A Comparison of Linear Processor Arrays for Image Processing Matthijs van der Molen

    E-print Network

    van Vliet, Lucas J.

    /VME) and through interrupts. The cards consist of three principle components: The Control Processor (CP) The CP. In the IMAP-VISION, this bus is con- nected to Video Digital-to-Analog (DAC) and Analog-to-Digital (ADC bus version. The card com- prises the aforementioned video I/O interface. The card is controlled

  5. Parallel parsing on a one-way array of finite-state machines

    SciTech Connect

    Chang, J.H.; Ibarra, O.H.; Palis, M.A.

    1987-01-01

    The authors show that a one-way two-dimensional iterative array of finite-state machines (2-DIA) can recognize and parse strings of any context-free language in linear time. What makes this result interesting and rather surprising is the fact that each processor of the array holds only a fixed amount of information (independent of the size of the input) and communicates with its neighbors in only one direction. This makes for a simple VLSI implementation. Although it is known that recognition can be done on a 2-DIA, previous parsing algorithms require the processors to have unbounded memory, even when the communication is two-way. They also consider the problem of finding approximate patterns in strings, the string-to-string correction problem, and the longest common subsequence problem, and show that they can be solved in linear time on a 2-DIA.

  6. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  7. Electrostatic quadrupole array for focusing parallel beams of charged particles

    DOEpatents

    Brodowski, John (Smithtown, NY)

    1982-11-23

    An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

  8. High-performance ultra-low power VLSI analog processor for data compression

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul (Inventor)

    1996-01-01

    An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

  9. Parallel array of independent thermostats for column separations

    DOEpatents

    Foret, Frantisek; Karger, Barry L.

    2005-08-16

    A thermostat array including an array of two or more capillary columns (10) or two or more channels in a microfabricated device is disclosed. A heat conductive material (12) surrounded each individual column or channel in array, each individual column or channel being thermally insulated from every other individual column or channel. One or more independently controlled heating or cooling elements (14) is positioned adjacent to individual columns or channels within the heat conductive material, each heating or cooling element being connected to a source of heating or cooling, and one or more independently controlled temperature sensing elements (16) is positioned adjacent to the individual columns or channels within the heat conductive material. Each temperature sensing element is connected to a temperature controller.

  10. 1. Adaptive Self-Repairing Processor Array. U. S. Patent No. 4,591,980. 2. Detection of Motion in the Presence of Noise. U. S. Patent No. 4,835,732.

    E-print Network

    Huberman, Bernardo A.

    Patents 1. Adaptive Self-Repairing Processor Array. U. S. Patent No. 4,591,980. 2. Detection of Motion in the Presence of Noise. U. S. Patent No. 4,835,732. 3. Adaptive Processor Array Capable of Learning Variable Associations Useful in Recognizing Classes of Inputs. U. S. Patent No. 4,835,680. 4

  11. Fully parallel write/read in resistive synaptic array for accelerating on-chip learning

    NASA Astrophysics Data System (ADS)

    Gao, Ligang; Wang, I.-Ting; Chen, Pai-Yu; Vrudhula, Sarma; Seo, Jae-sun; Cao, Yu; Hou, Tuo-Hung; Yu, Shimeng

    2015-11-01

    A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaO x /TiO2/Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-by-row write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ?95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.

  12. Array combination for parallel imaging in Magnetic Resonance Imaging 

    E-print Network

    Spence, Dan Kenrick

    2007-09-17

    In Magnetic Resonance Imaging, the time required to generate an image is proportional to the number of steps used to encode the spatial information. In rapid imaging, an array of coil elements and receivers are used to reduce the number of encoding...

  13. VLSI processor with a configurable processing element array for balanced feature extraction in high-resolution images

    NASA Astrophysics Data System (ADS)

    Zhu, Hongbo; Shibata, Tadashi

    2014-01-01

    A VLSI processor employing a configurable processing element array (PEA) is developed for a newly proposed balanced feature extraction algorithm. In the algorithm, the input image is divided into square regions and the number of features is determined by noise effect analysis in each region. Regions of different sizes are used according to the resolutions and contents of input images. Therefore, inside the PEA, processing elements are hierarchically grouped for feature extraction in regions of different sizes. A proof-of-concept chip is fabricated using a 0.18 µm CMOS technology with a 32 × 32 PEA. From measurement results, a speed of 7.5 kfps is achieved for feature extraction in 128 × 128 pixel regions when operating the chip at 45 MHz, and a speed of 55 fps is also achieved for feature extraction in 1920 × 1080 pixel images.

  14. A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system

    NASA Technical Reports Server (NTRS)

    Olariu, S.; Schwing, J.; Zhang, J.

    1991-01-01

    A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

  15. High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures

    NASA Technical Reports Server (NTRS)

    Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

    1985-01-01

    Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

  16. Computer Processor Allocator

    Energy Science and Technology Software Center (ESTSC)

    2004-03-01

    The Compute Processor Allocator (CPA) provides an efficient and reliable mechanism for managing and allotting processors in a massively parallel (MP) computer. It maintains information in a database on the health. configuration and allocation of each processor. This persistent information is factored in to each allocation decision. The CPA runs in a distributed fashion to avoid a single point of failure.

  17. TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 2006 1 Performance Models for Network Processor Design

    E-print Network

    Nagurney, Anna

    for Network Processor Design Tilman Wolf, Member, IEEE, Mark A. Franklin, Fellow, IEEE Abstract-- To provide and their associated components. An increasingly central component in router design is a chip- multiprocessor (CMP) referred to a "network processor" or NP. In addition to multiple processors, NPs have multiple forms of on

  18. Trajectory optimization for real-time guidance. I - Time-varying LQR on a parallel processor

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.; Park, Kihong

    1990-01-01

    A key algorithmic element of a real-time trajectory optimization hardware/software implementation, the quadratic program (QP) solver element, is presented. The purpose of the effort is to make nonlinear trajectory optimization fast enough to provide real-time commands during guidance of a vehicle such as an aeromaneuvering orbiter. Many methods of nonlinear programming require the solution of a QP at each iteration. In the trajectory optimization case the QP has a special dynamic programming structure, a LQR-like structure. QP algorithm speed is increased by taking advantage of this special structure and by parallel implementation.

  19. Breast ultrasound tomography with two parallel transducer arrays: preliminary clinical results

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Intrator, Miranda; Hanson, Kenneth; Epstein, Katherine; Sandoval, Daniel; Williamson, Michael

    2015-03-01

    Ultrasound tomography has great potential to provide quantitative estimations of physical properties of breast tumors for accurate characterization of breast cancer. We design and manufacture a new synthetic-aperture breast ultrasound tomography system with two parallel transducer arrays. The distance of these two transducer arrays is adjustable for scanning breasts with different sizes. The ultrasound transducer arrays are translated vertically to scan the entire breast slice by slice and acquires ultrasound transmission and reflection data for whole-breast ultrasound imaging and tomographic reconstructions. We use the system to acquire patient data at the University of New Mexico Hospital for clinical studies. We present some preliminary imaging results of in vivo patient ultrasound data. Our preliminary clinical imaging results show promising of our breast ultrasound tomography system with two parallel transducer arrays for breast cancer imaging and characterization.

  20. Parallel RNA extraction using magnetic beads and a droplet array

    PubMed Central

    Shi, Xu; Chen, Chun-Hong; Gao, Weimin; Meldrum, Deirdre R.

    2015-01-01

    Nucleic acid extraction is a necessary step for most genomic/transcriptomic analyses, but it often requires complicated mechanisms to be integrated into a lab-on-a-chip device. Here, we present a simple, effective configuration for rapidly obtaining purified RNA from low concentration cell medium. This Total RNA Extraction Droplet Array (TREDA) utilizes an array of surface-adhering droplets to facilitate the transportation of magnetic purification beads seamlessly through individual buffer solutions without solid structures. The fabrication of TREDA chips is rapid and does not require a microfabrication facility or expertise. The process takes less than 5 minutes. When purifying mRNA from bulk marine diatom samples, its repeatability and extraction efficiency are comparable to conventional tube-based operations. We demonstrate that TREDA can extract the total mRNA of about 10 marine diatom cells, indicating that the sensitivity of TREDA approaches single-digit cell numbers. PMID:25519439

  1. Ionic liquid as a suitable phase for multistep parallel synthesis of an array of isoxazolines.

    PubMed

    Rodriquez, Manuela; Sega, Alessandro; Taddei, Maurizio

    2003-10-30

    [reaction: see text]. A parallel array of isoxazoline diamides was prepared using an ionic liquid [bmim][BF4] as the phase where a three-step procedure (Schotten-Baumann, 1,3-dipolar cycloaddition, ester amidation with Me3Al) was carried out. At the end, selective extraction of the final products with diethyl ether allowed simple isolation of the 16 components of the array (Syncore technology). PMID:14572241

  2. Method of up-front load balancing for local memory parallel processors

    NASA Technical Reports Server (NTRS)

    Baffes, Paul Thomas (inventor)

    1990-01-01

    In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

  3. AN ANALOGUE SIMD FOCAL-PLANE PROCESSOR ARRAY Piotr Dudek and Peter J. Hicks

    E-print Network

    Dudek, Piotr

    feature is a mesh-connected array of analogue processing elements (APEs). Each APE, associated on analogue samples of data, yet the APEs work in a software-programmable SIMD fashion. They execute a sequence of instructions issued by an external digital controller. The APEs support a fairly conventional

  4. Accuracy and Efficiency of Grey-level Image Filtering on VLSI Cellular Processor Arrays

    E-print Network

    Dudek, Piotr

    -level image processing operations - linear convolutions with 3Ã?3 kernels. Speed, accuracy, power consumption. Introduction The cellular neural networks (CNNs) were proposed not only as a paradigm for complexity, but also, where the convolution operation is executed in parallel. 1.1. Image Processing on CNN chips It is often

  5. Massively parallel computation of lattice associative memory classifiers on multicore processors

    NASA Astrophysics Data System (ADS)

    Ritter, Gerhard X.; Schmalz, Mark S.; Hayden, Eric T.

    2011-09-01

    Over the past quarter century, concepts and theory derived from neural networks (NNs) have featured prominently in the literature of pattern recognition. Implementationally, classical NNs based on the linear inner product can present performance challenges due to the use of multiplication operations. In contrast, NNs having nonlinear kernels based on Lattice Associative Memories (LAM) theory tend to concentrate primarily on addition and maximum/minimum operations. More generally, the emergence of LAM-based NNs, with their superior information storage capacity, fast convergence and training due to relatively lower computational cost, as well as noise-tolerant classification has extended the capabilities of neural networks far beyond the limited applications potential of classical NNs. This paper explores theory and algorithmic approaches for the efficient computation of LAM-based neural networks, in particular lattice neural nets and dendritic lattice associative memories. Of particular interest are massively parallel architectures such as multicore CPUs and graphics processing units (GPUs). Originally developed for video gaming applications, GPUs hold the promise of high computational throughput without compromising numerical accuracy. Unfortunately, currently-available GPU architectures tend to have idiosyncratic memory hierarchies that can produce unacceptably high data movement latencies for relatively simple operations, unless careful design of theory and algorithms is employed. Advantageously, some GPUs (e.g., the Nvidia Fermi GPU) are optimized for efficient streaming computation (e.g., concurrent multiply and add operations). As a result, the linear or nonlinear inner product structures of NNs are inherently suited to multicore GPU computational capabilities. In this paper, the authors' recent research in lattice associative memories and their implementation on multicores is overviewed, with results that show utility for a wide variety of pattern classification applications using classical NNs or lattice-based NNs. Dataflow diagrams are presented in terms of a parameterized model of data burden and LAM partitioning.

  6. Parallel transport of biological cells using individually addressable VCSEL arrays as optical tweezers

    E-print Network

    Esener, Sadik C.

    Parallel transport of biological cells using individually addressable VCSEL arrays as optical (VCSELs) for optical trapping and active manipulation of live biological cells and microspheres. We have experimentally verified that the Laguerre­Gaussian laser mode output from the VCSEL functions just as well

  7. Array Dataflow Analysis for Explicitly Parallel JeanFrancois Collard 1 and Martin Griebl 2

    E-print Network

    Passau, Universität

    Array Dataflow Analysis for Explicitly Parallel Programs Jean­Fran¸cois Collard 1 and Martin Griebl Versailles, FRANCE, Jean­Francois.Collard@prism.uvsq.fr 2 Martin Griebl, FMI, Universit¨at Passau, Innstra�e

  8. Comparison of Three Transmit Arrays for Parallel Transmit V. Alagappan1

    E-print Network

    Goyal, Vivek K

    in the homogenous excitation (RF shimming) experiment. The relative transmit efficiency was measured in the centerComparison of Three Transmit Arrays for Parallel Transmit V. Alagappan1 , E. Adalsteinsson2 , K Solutions, Charlestown, MA, United States Introduction:The inhomogeneous transmit B1 pattern in conventional

  9. Parallel recognition of cancer cells using an addressable array of solid-state micropores

    E-print Network

    Texas at Arlington, University of

    Parallel recognition of cancer cells using an addressable array of solid-state micropores Azhar profile Pulse signal Leukocytes Detection efficiency Single-cell measurement Single-cell analysis a b s t r a c t Early stage detection and precise quantification of circulating tumor cells (CTCs

  10. Achieving supercomputer performance for neural net simulation with an array of digital signal processors

    SciTech Connect

    Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W.

    1992-10-01

    Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

  11. Development of a ground signal processor for digital synthetic array radar data

    NASA Technical Reports Server (NTRS)

    Griffin, C. R.; Estes, J. M.

    1981-01-01

    A modified APQ-102 sidelooking array radar (SLAR) in a B-57 aircraft test bed is used, with other optical and infrared sensors, in remote sensing of Earth surface features for various users at NASA Johnson Space Center. The video from the radar is normally recorded on photographic film and subsequently processed photographically into high resolution radar images. Using a high speed sampling (digitizing) system, the two receiver channels of cross-and co-polarized video are recorded on wideband magnetic tape along with radar and platform parameters. These data are subsequently reformatted and processed into digital synthetic aperture radar images with the image data available on magnetic tape for subsequent analysis by investigators. The system design and results obtained are described.

  12. Development of a ground signal processor for digital synthetic array radar data

    NASA Astrophysics Data System (ADS)

    Griffin, C. R.; Estes, J. M.

    1981-05-01

    A modified APQ-102 sidelooking array radar (SLAR) in a B-57 aircraft test bed is used, with other optical and infrared sensors, in remote sensing of Earth surface features for various users at NASA Johnson Space Center. The video from the radar is normally recorded on photographic film and subsequently processed photographically into high resolution radar images. Using a high speed sampling (digitizing) system, the two receiver channels of cross-and co-polarized video are recorded on wideband magnetic tape along with radar and platform parameters. These data are subsequently reformatted and processed into digital synthetic aperture radar images with the image data available on magnetic tape for subsequent analysis by investigators. The system design and results obtained are described.

  13. Using a Cray Y-MP as an array processor for a RISC Workstation

    NASA Technical Reports Server (NTRS)

    Lamaster, Hugh; Rogallo, Sarah J.

    1992-01-01

    As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.

  14. Mitigation of cache memory using an embedded hard-core PPC440 processor in a Virtex-5 Field Programmable Gate Array.

    SciTech Connect

    Learn, Mark Walter

    2010-02-01

    Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not available to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.

  15. Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit

    SciTech Connect

    Daily, Jeffrey A.; Lewis, Robert R.

    2011-11-30

    Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

  16. Cross-link induced linear and curved polymer channel waveguide arrays for massively parallel optical interconnects

    NASA Astrophysics Data System (ADS)

    Chen, Ray T.

    1993-01-01

    A single-mode polymer-based channel waveguide array with 1250 channels/cm packaging density on a cross-link induced photopolymeric thin film is reported. This array works at 1.31 micrometers and 0.63 micrometers . Curved waveguides with radii of curvature (ROC) from 1 mm to 40 mm were demonstrated. Waveguide propagation loss in the neighborhood of 0.1 db/cm was demonstrated for both linear and curved waveguides. Interconnectivity for various interconnection architectures including cross bar, hypercube, daisy chain and star are further considered. Multiple layers of optical interconnects may be required for an optical backplane involving massively parallel highly distributed computing systems.

  17. Photon detection with parallel asynchronous processing

    NASA Technical Reports Server (NTRS)

    Coon, D. D.; Perera, A. G. U.

    1990-01-01

    An approach to photon detection with a parallel asynchronous signal processor is described. The visible or IR photon-detection capability of the silicon p(+)-n-n(+) detectors and the parallel asynchronous processing are addressed separately. This approach would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture consisting of a stack of planar arrays of the devices would form a 2D array processor with a 2D array of inputs located directly behind a focal-plane detector array. A 2D image data stream would propagate in neuronlike asynchronous pulse-coded form through the laminar processor. Such systems can integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The possibility of multispectral image processing is addressed.

  18. The effect of steroids on peripheral blood lymphocytes containing parallel tubular arrays.

    PubMed Central

    Payne, C. M.; Glasser, L.

    1978-01-01

    The response of lymphocytes containing cytoplasmic inclusions called parallel tubular arrays (PTA) was determined after the administration of the glucocorticoid dexamethasone to 10 healthy volunteers. The percentage of these lymphocytes was found to increase during the lymphopenia induced by steroid administration. The size and number of parallel tubular arrays per cell showed no differences before and after steroid administration, indicating that the increase was a result of a change in the proportion of whole cells. This indicates, for the first time, that a morphologically defined population of lymphocytes from the normal peripheral circulation has been linked to a specific response, ie, steroid resistance. The possible mechanism of steroid resistance is discussed. Images Figure 2 Figure 1 PMID:686151

  19. CombinePlt and CombineThs user manual: Merging multiple, processor-local plot and time-history data bases produced during a parallel calculation. Revision 1

    SciTech Connect

    Procassini, R.J.; DeGroot, A.J.

    1995-09-21

    The CombinePlt and CombineThs post-processing utilities are designed to merge the data in multiple, processor-local plot and time-history data bases produced by the parallel versions of the analysis codes DYNA3D, NIKE3D or PING into a serial database which is compatible with the existing versions of the GRIZ and THUG visualization tools. These utilities make use of the partition assignment file produced by the PartMesh suite for pre-processing utilities to map the data from the processor-local order to global order. These utilities are also capable of translating 64-bit IEEE data bases into 32-bit IEEE data bases which are required for post-processing with GRIZ or THUG on an SGI workstation.

  20. CombinePlt and CombineThs user manual: Merging multiple, processor-local plot and time-history data bases produced during a parallel calculation

    SciTech Connect

    Procassini, R.J.; DeGroot, A.J.

    1995-06-01

    The CombinePlt and CombineThs post-processing utilities are designed to merge the data in multiple, processor-local plot and time-history data bases produced by the parallel versions of the analysis codes DYNA3D, NIKE3D or PING into a serial data base which is compatible with the existing versions of the GRIZ and THUG visualization tools. These utilities make use of the partition assignment file produced by the PartMesh suite of pre-processing utilities to map the data from the processor-local order to global order. These utilities are also capable of translating 64-bit IEEE data bases into 32-bit IEEE data bases which are required for post-processing with GRIZ or THUG on an SGI workstation.

  1. Sequence information signal processor

    DOEpatents

    Peterson, John C. (Alta Loma, CA); Chow, Edward T. (San Dimas, CA); Waterman, Michael S. (Culver City, CA); Hunkapillar, Timothy J. (Pasadena, CA)

    1999-01-01

    An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

  2. Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation.

    PubMed

    Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan

    2015-08-01

    Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ?7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection. PMID:26180050

  3. Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation

    NASA Astrophysics Data System (ADS)

    Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan

    2015-08-01

    Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ?7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection.

  4. Parallel and series FED microstrip array with high efficiency and low cross polarization

    NASA Technical Reports Server (NTRS)

    Huang, John (inventor)

    1995-01-01

    A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

  5. Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

    SciTech Connect

    Werner, N.E.; Van Matre, S.W.

    1985-05-01

    This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.

  6. Two-Dimensional Systolic Array For Kalman-Filter Computing

    NASA Technical Reports Server (NTRS)

    Chang, Jaw John; Yeh, Hen-Geul

    1988-01-01

    Two-dimensional, systolic-array, parallel data processor performs Kalman filtering in real time. Algorithm rearranged to be Faddeev algorithm for generalized signal processing. Algorithm mapped onto very-large-scale integrated-circuit (VLSI) chip in two-dimensional, regular, simple, expandable array of concurrent processing cells. Processor does matrix/vector-based algebraic computations. Applications include adaptive control of robots, remote manipulators and flexible structures and processing radar signals to track targets.

  7. Dynamic scheduling and planning parallel observations on large Radio Telescope Arrays with the Square Kilometre Array in mind

    NASA Astrophysics Data System (ADS)

    Buchner, Johannes

    2011-12-01

    Scheduling, the task of producing a time table for resources and tasks, is well-known to be a difficult problem the more resources are involved (a NP-hard problem). This is about to become an issue in Radio astronomy as observatories consisting of hundreds to thousands of telescopes are planned and operated. The Square Kilometre Array (SKA), which Australia and New Zealand bid to host, is aiming for scales where current approaches -- in construction, operation but also scheduling -- are insufficent. Although manual scheduling is common today, the problem is becoming complicated by the demand for (1) independent sub-arrays doing simultaneous observations, which requires the scheduler to plan parallel observations and (2) dynamic re-scheduling on changed conditions. Both of these requirements apply to the SKA, especially in the construction phase. We review the scheduling approaches taken in the astronomy literature, as well as investigate techniques from human schedulers and today's observatories. The scheduling problem is specified in general for scientific observations and in particular on radio telescope arrays. Also taken into account is the fact that the observatory may be oversubscribed, requiring the scheduling problem to be integrated with a planning process. We solve this long-term scheduling problem using a time-based encoding that works in the very general case of observation scheduling. This research then compares algorithms from various approaches, including fast heuristics from CPU scheduling, Linear Integer Programming and Genetic algorithms, Branch-and-Bound enumeration schemes. Measures include not only goodness of the solution, but also scalability and re-scheduling capabilities. In conclusion, we have identified a fast and good scheduling approach that allows (re-)scheduling difficult and changing problems by combining heuristics with a Genetic algorithm using block-wise mutation operations. We are able to explain and eradicate two problems in the literature: The inability of a GA to properly improve schedules and the generation of schedules with frequent interruptions. Finally, we demonstrate the scheduling framework for several operating telescopes: (1) Dynamic re-scheduling with the AUT Warkworth 12m telescope, (2) Scheduling for the Australian Mopra 22m telescope and scheduling for the Allen Telescope Array. Furthermore, we discuss the applicability of the presented scheduling framework to the Atacama Large Millimeter/submillimeter Array (ALMA, in construction) and the SKA. In particular, during the development phase of the SKA, this dynamic, scalable scheduling framework can accommodate changing conditions.

  8. Nanopore arrays in a silicon membrane for parallel single-molecule detection: fabrication

    NASA Astrophysics Data System (ADS)

    Schmidt, Torsten; Zhang, Miao; Sychugov, Ilya; Roxhed, Niclas; Linnros, Jan

    2015-08-01

    Solid state nanopores enable translocation and detection of single bio-molecules such as DNA in buffer solutions. Here, sub-10 nm nanopore arrays in silicon membranes were fabricated by using electron-beam lithography to define etch pits and by using a subsequent electrochemical etching step. This approach effectively decouples positioning of the pores and the control of their size, where the pore size essentially results from the anodizing current and time in the etching cell. Nanopores with diameters as small as 7 nm, fully penetrating 300 nm thick membranes, were obtained. The presented fabrication scheme to form large arrays of nanopores is attractive for parallel bio-molecule sensing and DNA sequencing using optical techniques. In particular the signal-to-noise ratio is improved compared to other alternatives such as nitride membranes suffering from a high-luminescence background.

  9. Control scheme for microcomputers being used in multiprocessor arrays

    SciTech Connect

    Meng, J.; Gin, F.

    1984-06-01

    In general, microcomputer central processor devices are completely controllable from memory and memory control lines. By interjecting a controlling processor between the central processor chip and its memory, and using the central processor memory ready signal for synchronization, data can be supplied to the microprocessor either from an attached memory or from the controlling processor. The controlling processor may also download codes into the microprocessor's memory to be used either as programs or as data. By manipulating restart, hold and interrupt signal lines in addition to the memory lines, total control is achieved. Such a scheme can be used to orchestrate the simultaneous application of arrays of microcomputers to single large problems or to many discrete smaller problems. We describe the details of such connections to three commercially available devices: a Motorola 68000, an Advanced Micro Devices 29116 and a National Semiconductor NS32032 and indicate how our scheme may be used to connect such devices into a cooperating parallel array.

  10. Parallel computing methods for x-ray cone beam tomography with large array sizes

    SciTech Connect

    Reimann, D.A.; Flynn, M.J.; Sethi, I.K.

    1996-12-31

    Cone beam geometries are increasingly of interest for x-ray CT applications to improve imaging efficiency. In this paper, we describe our practical experience implementing circular orbit cone beam backprojection on workstation clusters. The reconstruction problem is computationally intensive, particularly for arrays of 512 voxels; in each direction. A voxel driven approach is described where the reconstruction volume is partitioned into variable width slabs and each slab given to a workstation. Each projection is filtered by one workstation and then sent to the others for backprojection. While most computation is done in the backprojection step, a significant amount of time must be spent in sending projectional data. A method is detailed to further reduce the communication overhead by restricting the amount of projection sent to only what is required by each backprojecting workstation. Furthermore, if the shape of the backprojection slabs is made as square as possible, the total communication requirement can be minimized. By the reduction of communication requirement, an overall improvement in processor utilization was observed, and the crossover point where communications dominates was improved.

  11. Performance of the UCAN2 Gyrokinetic Particle In Cell (PIC) Code on Two Massively Parallel Mainframes with Intel ``Sandy Bridge'' Processors

    NASA Astrophysics Data System (ADS)

    Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul

    2013-10-01

    The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.

  12. High Density Single-Molecule-Bead Arrays for Parallel Single Molecule Force Spectroscopy

    PubMed Central

    Barrett, Michael J.; Oliver, Piercen M.; Cheng, Peng; Cetin, Deniz; Vezenov, Dmitri

    2012-01-01

    The assembly of a highly-parallel force spectroscopy tool requires careful placement of single-molecule targets on the substrate and the deliberate manipulation of a multitude of force probes. Since the probe must approach the target biomolecule for covalent attachment, while avoiding irreversible adhesion to the substrate, the use of the polymer microsphere as force probes to create the tethered bead array poses a problem. Therefore, the interactions between the force probe and the surface must be repulsive at very short distances (< 5 nm) and attractive at long distances. To achieve this balance, the chemistry of the substrate, force probe, and solution must be tailored to control the probe-surface interactions. In addition to an appropriately designed chemistry, it is necessary to control the surface density of the target molecule in order to ensure that only one molecule is interrogated by a single force probe. We used gold-thiol chemistry to control both the substrate’s surface chemistry and the spacing of the studied molecules, through a competitive binding of the thiol-terminated DNA and an inert thiol forming a blocking layer. For our single molecule array, we modeled the forces between the probe and the substrate using DLVO theory and measured their magnitude and direction with colloidal probe microscopy. The practicality of each system was tested using a probe binding assay to evaluate the proportion of the beads remaining adhered to the surface after application of force. We have translated the results specific for our system to general guiding principles for preparation of tethered bead arrays and demonstrated the ability of this system to produce a high yield of active force spectroscopy probes in a microwell substrate. This study outlines the characteristics of the chemistry needed to create such a force spectroscopy array. PMID:22548234

  13. A micromachined silicon parallel acoustic delay line (PADL) array for real-time photoacoustic tomography (PAT)

    NASA Astrophysics Data System (ADS)

    Cho, Young Y.; Chang, Cheng-Chung; Wang, Lihong V.; Zou, Jun

    2015-03-01

    To achieve real-time photoacoustic tomography (PAT), massive transducer arrays and data acquisition (DAQ) electronics are needed to receive the PA signals simultaneously, which results in complex and high-cost ultrasound receiver systems. To address this issue, we have developed a new PA data acquisition approach using acoustic time delay. Optical fibers were used as parallel acoustic delay lines (PADLs) to create different time delays in multiple channels of PA signals. This makes the PA signals reach a single-element transducer at different times. As a result, they can be properly received by single-channel DAQ electronics. However, due to their small diameter and fragility, using optical fiber as acoustic delay lines poses a number of challenges in the design, construction and packaging of the PADLs, thereby limiting their performances and use in real imaging applications. In this paper, we report the development of new silicon PADLs, which are directly made from silicon wafers using advanced micromachining technologies. The silicon PADLs have very low acoustic attenuation and distortion. A linear array of 16 silicon PADLs were assembled into a handheld package with one common input port and one common output port. To demonstrate its real-time PAT capability, the silicon PADL array (with its output port interfaced with a single-element transducer) was used to receive 16 channels of PA signals simultaneously from a tissue-mimicking optical phantom sample. The reconstructed PA image matches well with the imaging target. Therefore, the silicon PADL array can provide a 16× reduction in the ultrasound DAQ channels for real-time PAT.

  14. Optical signal processing of phased array radar

    NASA Astrophysics Data System (ADS)

    Weverka, Robert T.

    This thesis develops optical processors that scale to very high processing speed. Optical signal processing is often promoted on the basis of smaller size, lower weight and lower power consumption as well as higher signal processing speed. While each of these requirements has applications, it is the ones that require processing speed beyond that available in electronics that are most compelling. Thirty years ago, optical processing was the only method fast enough to process Synthetic Aperture Radar (SAR), one of the more demanding signal processing tasks at this time. Since that time electronic processing speed has improved sufficiently to tackle that problem. We have sought out the problems that require significantly higher processing speed and developed optical processors that tackle these more difficult problems. The components that contribute to high signal processing speed are high input signal bandwidth, a large number of parallel input channels each with this high bandwidth, and a large number of parallel operations required on each input channel. Adaptive signal processing for phased array radar has all of these factors. The processors developed for this task scale well in three dimensions, which allows them to maximize parallelism for high speed. This thesis explores an example of a negative feedback adaptive phased array processor and an example of a positive feedback phased array processor. The negative feedback processor uses and array of inputs in up to two dimensions together with the time history of the signal in the third dimension to adapt the array pattern to null out incoming jammer signals. The positive feedback processor uses the incoming signals and assumptions about the radar scene to correct for position errors in a phased array. Discovery and analysis of these new processors are facilitated by an original volume holographic analysis technique developed in the thesis. The thesis includes a new acoustooptic Bragg cell geometry developed with this analysis technique. This Bragg cell provides a low insertion delay making it suitable for the feedback phased array radar systems. This thesis develops a new algorithm for phased array radar processing. This adaptation of the Widrow algorithm requires fewer delay lines allowing us to implement a system that can scale to dense two-dimensional phased array radar. The thesis explores this processor in depth, developing the description of the system evolution, the nonlinear dynamics governing the system and the dynamic range: that can be achieved. The system behavior and dynamics are confirmed experimentally. Finally this thesis explores positive feed back architectures for the phased radar problem posed by Steinberg in which the array itself is poorly surveyed. To our knowledge, optical signal processing solutions to this problem have not been developed prior to this work.

  15. Computation and parallel implementation for early vision

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    1990-01-01

    The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

  16. Modeling of the phase lag causing fluidelastic instability in a parallel triangular tube array

    NASA Astrophysics Data System (ADS)

    Khalifa, Ahmed; Weaver, David; Ziada, Samir

    2013-11-01

    Fluidelastic instability is considered a critical flow induced vibration mechanism in tube and shell heat exchangers. It is believed that a finite time lag between tube vibration and fluid response is essential to predict the phenomenon. However, the physical nature of this time lag is not fully understood. This paper presents a fundamental study of this time delay using a parallel triangular tube array with a pitch ratio of 1.54. A computational fluid dynamics (CFD) model was developed and validated experimentally in an attempt to investigate the interaction between tube vibrations and flow perturbations at lower reduced velocities Ur=1-6 and Reynolds numbers Re=2000-12 000. The numerical predictions of the phase lag are in reasonable agreement with the experimental measurements for the range of reduced velocities Ug/fd=6-7. It was found that there are two propagation mechanisms; the first is associated with the acoustic wave propagation at low reduced velocities, Ur<2, and the second mechanism for higher reduced velocities is associated with the vorticity shedding and convection. An empirical model of the two mechanisms is developed and the phase lag predictions are in reasonable agreement with the experimental and numerical measurements. The developed phase lag model is then coupled with the semi-analytical model of Lever and Weaver to predict the fluidelastic stability threshold. Improved predictions of the stability boundaries for the parallel triangular array were achieved. In addition, the present study has explained why fluidelastic instability does not occur below some threshold reduced velocity.

  17. Comparison of 3-D synthetic aperture phased-array ultrasound imaging and parallel beamforming.

    PubMed

    Rasmussen, Morten Fischer; Jensen, Jørgen Arendt

    2014-10-01

    This paper demonstrates that synthetic aperture imaging (SAI) can be used to achieve real-time 3-D ultrasound phased-array imaging. It investigates whether SAI increases the image quality compared with the parallel beamforming (PB) technique for real-time 3-D imaging. Data are obtained using both simulations and measurements with an ultrasound research scanner and a commercially available 3.5- MHz 1024-element 2-D transducer array. To limit the probe cable thickness, 256 active elements are used in transmit and receive for both techniques. The two imaging techniques were designed for cardiac imaging, which requires sequences designed for imaging down to 15 cm of depth and a frame rate of at least 20 Hz. The imaging quality of the two techniques is investigated through simulations as a function of depth and angle. SAI improved the full-width at half-maximum (FWHM) at low steering angles by 35%, and the 20-dB cystic resolution by up to 62%. The FWHM of the measured line spread function (LSF) at 80 mm depth showed a difference of 20% in favor of SAI. SAI reduced the cyst radius at 60 mm depth by 39% in measurements. SAI improved the contrast-to-noise ratio measured on anechoic cysts embedded in a tissue-mimicking material by 29% at 70 mm depth. The estimated penetration depth on the same tissue-mimicking phantom shows that SAI increased the penetration by 24% compared with PB. Neither SAI nor PB achieved the design goal of 15 cm penetration depth. This is likely due to the limited transducer surface area and a low SNR of the experimental scanner used. PMID:25265174

  18. Acoustic insertion loss due to two dimensional periodic arrays of circular cylinders parallel to a nearby surface

    E-print Network

    Anton Krynkin; Olga Umnova; Juan Vicente Sanchez-Perez; Alvin Y. B. Chong; Shahram Taherzadeh; Keith Attenborough

    2012-07-03

    The acoustical performances of regular arrays of cylindrical elements, with their axes aligned and parallel to a ground plane, have been investigated through predictions and laboratory experiments. Semi-analytical predictions based on multiple scattering theory and numerical simulations based on a boundary element formulation have been made. Measurements have been made in an anechoic chamber using arrays of (a) cylindrical acoustically-rigid scatterers (PVC pipes) and (b) thin elastic shells. Insertion loss (IL) spectra due to the arrays have been measured without and with ground planes for several receiver heights. Data and predictions have been compared. The minima in the excess attenuation spectrum i.e., attenuation maxima due to the ground alone resulting from destructive interference between direct and ground-reflected sound waves, tend to have an adverse influence on the band gaps (BG) related to a periodic array in the free field when these two effects coincide. On the other hand, the presence of rigid ground may result in an IL for an array near the ground similar to or, in the case of the first BG, greater than that resulting from a double array, equivalent to the original array plus its ground plane mirror image, in the free field.

  19. Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment

    NASA Astrophysics Data System (ADS)

    Cao, Z.; Walsh, J. L.; Kong, M. G.

    2009-01-01

    This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

  20. Parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers

    SciTech Connect

    Tucker, John R.; Baque, Johnathon L.; Lim, Yah Leng; Zvyagin, Andrei V.; Rakic, Aleksandar D

    2007-09-01

    In this paper we investigate the feasibility of a massively parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers (VCSELs) to measure surface profiles of displacement,distance, velocity, and liquid flow rate. The concept of the system is demonstrated using a prototype to measure the velocity at different radial points on a rotating disk, and the velocity profile of diluted milk in a custom built diverging-converging planar flow channel. It is envisaged that a scaled up version of the parallel self-mixing imaging system will enable real-time surface profiling, vibrometry, and flowmetry.

  1. Architecture of the parallel recirculating pipeline

    NASA Astrophysics Data System (ADS)

    Wehner, William W., II; Brandt, James

    1990-11-01

    Current image analysis and image understanding applications in DoD systems require very high performance image pixel processing in real time. To attain the necessary performance within stringent system size weight and power constraints requires special-purpose parallel processing hardware architectures. At the same time it is desirable to retain as much programmability as possible in order to rapidly adapt the hardware to new applications or evolving system requirements. The Parallel Recirculating Pipeline processor uses techniques adopted from image algebra and mathematical morphology to provide a low-cost low-complexity high-performance architecture that is suitable for silicon implementation and programmable in high-order languages. The parallel recirculating pipeline hardware architecture is based on a cellular array structure in which each cell is a pipelined neighborhood processor. Each processor cell transforms an entire image segment by successively executing an operation on small fixed-size neighborhoods around each pixel. By cascading a series of these operations transforms on larger neighborhoods can be achieved. The parallel recirculating pipeline achieves cascading by allowing a series of cells to be connected in a pipelined fashion. Partial results can recirculate several times through the hardware pipeline via an external buffer memory. A virtual pipeline of any length is thus achieved. Several novel features of the architecture allow multiple pipelines to operate in parallel on strips of the same image. These features can support parallel expansion to a large number of processors with correspondingly

  2. PDDP: A data parallel programming model. Revision 1

    SciTech Connect

    Warren, K.H.

    1995-06-01

    PDDP, the Parallel Data Distribution Preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP impelments High Performance Fortran compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the (WRERE?) construct. Distribued data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared-memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  3. The Milstar Advanced Processor

    NASA Astrophysics Data System (ADS)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  4. Integrated RF/shim coil array for parallel reception and localized B0 shimming in the human brain.

    PubMed

    Truong, Trong-Kha; Darnell, Dean; Song, Allen W

    2014-12-01

    The purpose of this work was to develop a novel integrated radiofrequency and shim (RF/shim) coil array that can perform parallel reception and localized B0 shimming in the human brain with the same coils, thereby maximizing both the signal-to-noise ratio and shimming efficiency. A 32-channel receive-only head coil array was modified to enable both RF currents (for signal reception) and direct currents (for B0 shimming) to flow in individual coil elements. Its in vivo performance was assessed in the frontal brain region, which is affected by large susceptibility-induced B0 inhomogeneities. The coil modifications did not reduce their quality factor or signal-to-noise ratio. Axial B0 maps and echo-planar images acquired in vivo with direct currents optimized to shim specific slices showed substantially reduced B0 inhomogeneities and image distortions in the frontal brain region. The B0 root-mean-square error in the anterior half of the brain was reduced by 60.3% as compared to that obtained with second-order spherical harmonic shimming. These results demonstrate that the integrated RF/shim coil array can perform parallel reception and localized B0 shimming in the human brain and provide a much more effective shimming than conventional spherical harmonic shimming alone, without taking up additional space in the magnet bore and without compromising the signal-to-noise ratio or shimming performance. PMID:25270602

  5. Integrated RF/shim coil array for parallel reception and localized B0 shimming in the human brain

    PubMed Central

    Truong, Trong-Kha; Darnell, Dean; Song, Allen W.

    2014-01-01

    The purpose of this work was to develop a novel integrated radiofrequency and shim (RF/shim) coil array that can perform parallel reception and localized B0 shimming in the human brain with the same coils, thereby maximizing both the signal-to-noise ratio and shimming efficiency. A 32-channel receive-only head coil array was modified to enable both RF currents (for signal reception) and direct currents (for B0 shimming) to flow in individual coil elements. Its in vivo performance was assessed in the frontal brain region, which is affected by large susceptibility-induced B0 in homogeneities. The coil modifications did not reduce their quality factor or signal-to-noise ratio. Axial B0 maps and echo-planar images acquired in vivo with direct currents optimized to shim specific slices showed substantially reduced B0 inhomogeneities and image distortions in the frontal brain region. The B0 root-mean-square error in the anterior half of the brain was reduced by 60.3% as compared to that obtained with second-order spherical harmonic shimming. These results demonstrate that the integrated RF/shim coil array can perform parallel reception and localized B0 shimming in the human brain and provide a much more effective shimming than conventional spherical harmonic shimming alone, without taking up additional space in the magnet bore and without compromising the signal-to-noise ratio or shimming performance. PMID:25270602

  6. Parallel nanomanufacturing via electrohydrodynamic jetting from microfabricated externally-fed emitter arrays.

    PubMed

    Ponce de Leon, Philip J; Hill, Frances A; Heubel, Eric V; Velásquez-García, Luis F

    2015-06-01

    We report the design, fabrication, and characterization of planar arrays of externally-fed silicon electrospinning emitters for high-throughput generation of polymer nanofibers. Arrays with as many as 225 emitters and with emitter density as large as 100 emitters cm(-2) were characterized using a solution of dissolved PEO in water and ethanol. Devices with emitter density as high as 25 emitters cm(-2) deposit uniform imprints comprising fibers with diameters on the order of a few hundred nanometers. Mass flux rates as high as 417 g hr(-1) m(-2) were measured, i.e., four times the reported production rate of the leading commercial free-surface electrospinning sources. Throughput increases with increasing array size at constant emitter density, suggesting the design can be scaled up with no loss of productivity. Devices with emitter density equal to 100 emitters cm(-2) fail to generate fibers but uniformly generate electrosprayed droplets. For the arrays tested, the largest measured mass flux resulted from arrays with larger emitter separation operating at larger bias voltages, indicating the strong influence of electrical field enhancement on the performance of the devices. Incorporation of a ground electrode surrounding the array tips helps equalize the emitter field enhancement across the array as well as control the spread of the imprints over larger distances. PMID:25961886

  7. Parallel nanomanufacturing via electrohydrodynamic jetting from microfabricated externally-fed emitter arrays

    NASA Astrophysics Data System (ADS)

    Ponce de Leon, Philip J.; Hill, Frances A.; Heubel, Eric V.; Velásquez-García, Luis F.

    2015-06-01

    We report the design, fabrication, and characterization of planar arrays of externally-fed silicon electrospinning emitters for high-throughput generation of polymer nanofibers. Arrays with as many as 225 emitters and with emitter density as large as 100 emitters cm-2 were characterized using a solution of dissolved PEO in water and ethanol. Devices with emitter density as high as 25 emitters cm-2 deposit uniform imprints comprising fibers with diameters on the order of a few hundred nanometers. Mass flux rates as high as 417 g hr-1 m-2 were measured, i.e., four times the reported production rate of the leading commercial free-surface electrospinning sources. Throughput increases with increasing array size at constant emitter density, suggesting the design can be scaled up with no loss of productivity. Devices with emitter density equal to 100 emitters cm-2 fail to generate fibers but uniformly generate electrosprayed droplets. For the arrays tested, the largest measured mass flux resulted from arrays with larger emitter separation operating at larger bias voltages, indicating the strong influence of electrical field enhancement on the performance of the devices. Incorporation of a ground electrode surrounding the array tips helps equalize the emitter field enhancement across the array as well as control the spread of the imprints over larger distances.

  8. Hardware multiplier processor

    DOEpatents

    Pierce, Paul E. (Albuquerque, NM)

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  9. Massively parallel visualization: Parallel rendering

    SciTech Connect

    Hansen, C.D.; Krogh, M.; White, W.

    1995-12-01

    This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume renderer use a MIMD approach. Implementations for these algorithms are presented for the Thinking Machines Corporation CM-5 MPP.

  10. 3D optical interconnect mesh network for on-board parallel multiprocessor system based on EOPCB

    NASA Astrophysics Data System (ADS)

    Luo, Fengguang; Cao, Mingcui; Zhou, Xinjun; Xu, Jun; Luo, Zhixiang; Yuan, Jing; Zong, Liangjia; Feng, Yonghua; Chen, Chao; Zhang, Conghui

    2007-11-01

    A three-dimensional (3-D) 4×4×4 optical interconnect Mesh network scheme for parallel multiprocessor system based on polymer light waveguide electro-optical printed circuit board(EOPCB) is proposed in this paper. The Mesh topological structures of light waveguide interconnects for processor element chip-to-chip on a board, and board-toboard on backplane is constructed. The system consists of 64 processor element chips interconnected in a 3-D Mesh network configuration. Every processor board comprises 4x4 processor element chips with Mesh interconnection. Board-to-board Mesh interconnects are established on a backplane through light waveguide Mesh interconnect topological structure. An additional optical layer with light waveguide structure is used in conventional PCB to construct EOPCB. Vertical cavity surface emitting laser (VCSEL) array is used as optical transmitter array. PIN photodiode array is used as optical receiver array. A MT-compatible direct coupling method is presented to couple light beam between optical transmitter/receiver with light waveguide layer. The optical signals from a processor element chip on a board can transmit to another processor element chip on another board through light waveguide interconnection in the backplane. So 3-D optical interconnection Mesh network for parallel multiprocessor system can be reailzed by EOPCB.

  11. Simulation of three-dimensional laminar flow and heat transfer in an array of parallel microchannels 

    E-print Network

    Mlcak, Justin Dale

    2009-05-15

    Heat transfer and fluid flow are studied numerically for a repeating microchannel array with water as the circulating fluid. Generalized transport equations are discretized and solved in three dimensions for velocities, pressure, and temperature...

  12. Graded index linear and curved polymer channel waveguide arrays for massively parallel optical interconnects

    NASA Astrophysics Data System (ADS)

    Chen, Ray T.

    1992-11-01

    A single-mode polymer-based graded index channel waveguide array with 1250 channels/cm packaging density on a cross-link induced photopolymeric thin film is reported. This array works at 1.31 and 0.63 microns. Curved waveguides with radii of curvature from 1 to 40 mm were demonstrated. Waveguide propagation loss in the neighborhood of 0.1 db/cm was experimentally confirmed at 1.31 microns.

  13. Tiled Multicore Processors

    NASA Astrophysics Data System (ADS)

    Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

    For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

  14. A novel polymeric microelectrode array for highly parallel, long-term neuronal culture and stimulation

    E-print Network

    Talei Franzesi, Giovanni

    2008-01-01

    Cell-based high-throughput screening is emerging as a disruptive technology in drug discovery; however, massively parallel electrical assaying of neurons and cardiomyocites has until now been prohibitively expensive. To ...

  15. Comparative Analysis on the Performance of a Short String of Series-Connected and Parallel-Connected Photovoltaic Array Under Partial Shading

    NASA Astrophysics Data System (ADS)

    Vijayalekshmy, S.; Rama Iyer, S.; Beevi, Bisharathu

    2015-09-01

    The output power from the photovoltaic (PV) array decreases and the array exhibit multiple peaks when it is subjected to partial shading (PS). The power loss in the PV array varies with the array configuration, physical location and the shading pattern. This paper compares the relative performance of a PV array consisting of a short string of three PV modules for two different configurations. The mismatch loss, shading loss, fill factor and the power loss due to the failure in tracking of the global maximum power point, of a series string with bypass diodes and short parallel string are analysed using MATLAB/Simulink model. The performance of the system is investigated for three different conditions of solar insolation for the same shading pattern. Results indicate that there is considerable power loss due to shading in a series string during PS than in a parallel string with same number of modules.

  16. Template-directed atomically precise self-organization of perfectly ordered parallel cerium silicide nanowire arrays on Si(110)-16?×?2 surfaces

    PubMed Central

    2013-01-01

    The perfectly ordered parallel arrays of periodic Ce silicide nanowires can self-organize with atomic precision on single-domain Si(110)-16?×?2 surfaces. The growth evolution of self-ordered parallel Ce silicide nanowire arrays is investigated over a broad range of Ce coverages on single-domain Si(110)-16?×?2 surfaces by scanning tunneling microscopy (STM). Three different types of well-ordered parallel arrays, consisting of uniformly spaced and atomically identical Ce silicide nanowires, are self-organized through the heteroepitaxial growth of Ce silicides on a long-range grating-like 16?×?2 reconstruction at the deposition of various Ce coverages. Each atomically precise Ce silicide nanowire consists of a bundle of chains and rows with different atomic structures. The atomic-resolution dual-polarity STM images reveal that the interchain coupling leads to the formation of the registry-aligned chain bundles within individual Ce silicide nanowire. The nanowire width and the interchain coupling can be adjusted systematically by varying the Ce coverage on a Si(110) surface. This natural template-directed self-organization of perfectly regular parallel nanowire arrays allows for the precise control of the feature size and positions within ±0.2 nm over a large area. Thus, it is a promising route to produce parallel nanowire arrays in a straightforward, low-cost, high-throughput process. PMID:24188092

  17. Photorefractive processing for large adaptive phased arrays

    NASA Astrophysics Data System (ADS)

    Weverka, Robert T.; Wagner, Kelvin; Sarto, Anthony

    1996-03-01

    An adaptive null-steering phased-array optical processor that utilizes a photorefractive crystal to time integrate the adaptive weights and null out correlated jammers is described. This is a beam-steering processor in which the temporal waveform of the desired signal is known but the look direction is not. The processor computes the angle(s) of arrival of the desired signal and steers the array to look in that direction while rotating the nulls of the antenna pattern toward any narrow-band jammers that may be present. We have experimentally demonstrated a simplified version of this adaptive phased-array-radar processor that nulls out the narrow-band jammers by using feedback-correlation detection. In this processor it is assumed that we know a priori only that the signal is broadband and the jammers are narrow band. These are examples of a class of optical processors that use the angular selectivity of volume holograms to form the nulls and look directions in an adaptive phased-array-radar pattern and thereby to harness the computational abilities of three-dimensional parallelism in the volume of photorefractive crystals. The development of this processing in volume holographic system has led to a new algorithm for phased-array-radar processing that uses fewer tapped-delay lines than does the classic time-domain beam former. The optical implementation of the new algorithm has the further advantage of utilization of a single photorefractive crystal to implement as many as a million adaptive weights, allowing the radar system to scale to large size with no increase in processing hardware.

  18. Dynamically reconfigurable optical morphological processor and its applications

    NASA Technical Reports Server (NTRS)

    Chao, Tien-Hsin

    1993-01-01

    An innovative optically implemented morphological processor is introduced. With the use of a large space-bandwidth-product Dammann grating and a high-speed shutter spatial light modulator, effective structuring element with large size and arbitrary shape can be constructed with dynamic reconfigurability. This reconfigurability is a major improvement over the conventional correlator-based morphological processor in which fixed holographic filters are used as structuring elements (Casasent and Botha, 1988). A novel two-dimensional thresholding photodetector array, capable of performing parallel thresholding and feedback, is utilized in this system and makes possible the implementation of many complex morphological operations requiring iterative feedbacks and full programmability. The optical architecture and the principle of operation are presented. Experimental demonstration of binary image morphological erosion, dilation, opening, and closing are also demonstrated. A technique for extending this technique to gray-scale image using thresholding decomposition technique is also discussed.

  19. Advanced parallel processing with supercomputer architectures

    SciTech Connect

    Hwang, K.

    1987-10-01

    This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers.

  20. Optimal expression evaluation for data parallel architectures

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    A data parallel machine represents an array or other composite data structure by allocating one processor (at least conceptually) per data item. A pointwise operation can be performed between two such arrays in unit time, provided their corresponding elements are allocated in the same processors. If the arrays are not aligned in this fashion, the cost of moving one or both of them is part of the cost of the operation. The choice of where to perform the operation then affects this cost. If an expression with several operands is to be evaluated, there may be many choices of where to perform the intermediate operations. An efficient algorithm is given to find the minimum-cost way to evaluate an expression, for several different data parallel architectures. This algorithm applies to any architecture in which the metric describing the cost of moving an array is robust. This encompasses most of the common data parallel communication architectures, including meshes of arbitrary dimension and hypercubes. Remarks are made on several variations of the problem, some of which are solved and some of which remain open.

  1. Parallel high-throughput microanalysis of materials using microfabricated full bridge device arrays

    NASA Astrophysics Data System (ADS)

    Potyrailo, Radislav A.; Morris, William G.

    2004-01-01

    An array of microfabricated full bridge devices has been implemented for the rapid thermal microanalysis of polymers. In each microelectromechanical system device, four strain gauges were formed in silicon cantilevered microbeams and were configured as a Wheatstone bridge circuit. Glass transition temperatures Tg were measured by the quantitation of the strain produced in the sensor by the stress applied by a polymer layer to the cantilevered microbeams. The measured strain was analyzed as a function of chip temperature for the change in the slope, which was indicative to Tg. Resolution of Tg determinations of amorphous and crystalline polymers was <0.25 °C and <2.0 °C, respectively, being attractive for combinatorial screening of polymers. Our approach is a practical alternative to known methods for Tg determinations because of the immunity to the variations in the amount of deposited material and its viscosity, vapor pressure of employed solvent, and ease of multiplexing into dense sensor arrays.

  2. Development of parallel architectures for sensor array-processing algorithms. Semi-Annual report

    SciTech Connect

    Jamali, M.M.; Kwatra, S.C.; Djoudi, A.; Sheelvant, R.; Rao, M.

    1991-08-01

    The high resolution direction of arrival (DOA) estimation has been an important area of research for a number of years. Many researchers have developed a variety of algorithms to estimate the direction of arrival. Another important aspect of the DOA estimation area is the development of high speed hardware capable of computing the DOA in real time. In this research the authors have first focussed on the development of parallel architecture for multiple signal classification (MUSIC) and estimation of signal parameters by rotational invariance technique (ESPRIT) algorithms for the narrow band sources. These algorithms are substituted with computationally efficient modules and converted to pipelined and parallel algorithms. For example one important computation of eigendecomposition of the covariance matrix has been performed using Householders transformations and QR method.

  3. Database Reorganization in Parallel Disk Arrays with I/O Service Stealing

    NASA Technical Reports Server (NTRS)

    Zabback, Peter; Onyuksel, Ibrahim; Scheuermann, Peter; Weikum, Gerhard

    1996-01-01

    We present a model for data reorganization in parallel disk systems that is geared towards load balancing in an environment with periodic access patterns. Data reorganization is performed by disk cooling, i.e. migrating files or extents from the hottest disks to the coldest ones. We develop an approximate queueing model for determining the effective arrival rates of cooling requests and discuss its use in assessing the costs versus benefits of cooling.

  4. Analog Processor To Solve Optimization Problems

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

    1993-01-01

    Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.

  5. Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems

    PubMed Central

    Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

    2011-01-01

    A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

  6. Field programmable gate array based parallel strapdown algorithm design for strapdown inertial navigation systems.

    PubMed

    Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

    2011-01-01

    A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

  7. Communication efficient parallel algorithms for nonnumerical computations

    SciTech Connect

    Doshi, K.A.

    1988-01-01

    The broad goal of this research is to develop a set of paradigms for mapping data-dependent symbolic computations on realistic models of parallel architectures. Within this goal, the thesis represents the initial effort to achieve efficient parallel solutions for a number of non-numerical problems on networks of processors. The specific contributions of the thesis are new parallel algorithms, exhibiting linear speedup on architectures consisting of fixed numbers of processors (i.e., bounded models). The following problems have been considered in the thesis: (1) Determine the minimum spanning tree (MST), and identify the bridges and articulation points (APs) of an undirected weighted graph represented by an n x n adjacency matrix. (2) The pattern matching problem: Given two strings of characters, of lengths m and n ({number sign}m) respectively, mark all positions in the second string where there appears an instance of the first string. (3) Sort n elements. For each problem, the author uses a processor-network consisting of p processors. The network model used in the solution of the first set of problems is the linear array; while that used in the solutions of the second and third problems is a butterfly-connected system. The solutions on the butterfly-connected system apply also on a pipelined hypercube. The performances of the solutions are summarized.

  8. Implementation and Assessment of Advanced Analog Vector-Matrix Processor

    NASA Technical Reports Server (NTRS)

    Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the processor is compared with that required for a range of typical problems. Calculations resolving alloy concentrations from spectral plume data of rocket engines are implemented on the optical processor, demonstrating its sufficiency for this problem. We also show how this technology can be easily extended to a 100 x 100 10 MHz (200 Cops) processor.

  9. Scripts for Scalable Monitoring of Parallel Filesystem Infrastructure

    SciTech Connect

    2014-02-27

    Scripts for scalable monitoring of parallel filesystem infrastructure provide frameworks for monitoring the health of block storage arrays and large InfiniBand fabrics. The block storage framework uses Python multiprocessing to within scale the number monitored arrays to scale with the number of processors in the system. This enables live monitoring of HPC-scale filesystem with 10-50 storage arrays. For InfiniBand monitoring, there are scripts included that monitor InfiniBand health of each host along with visualization tools for mapping the topology of complex fabric topologies.

  10. Parallel grid population

    DOEpatents

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  11. Parallel multi-step nanolithography by nanoscale Cu-covered h-PDMS tip array

    NASA Astrophysics Data System (ADS)

    Chang, Yuan-Jen; Huang, Han-Kuan

    2014-09-01

    Tip-based nanolithography provides a flexible nanolithographic technology. Tip fabrication is one of the main challenges. In this paper, we propose to combine the dry etching of photoresist and electro-chemical machining to reduce the size of the tip opening. We successfully fabricate a tip opening with a diameter of 200?nm. After lithography and lift-off, gold dot patterns with a diameter of 280?nm are demonstrated. Moreover, a home-made multi-step exposure system is built and both the successful 14- and 44-step nanolithography by a tip array are also demonstrated in the paper.

  12. Orthogonal and parallel lattice plasmon resonance in core-shell SiO(2)/Au nanocylinder arrays.

    PubMed

    Lin, Linhan; Yi, Yasha

    2015-01-12

    Height induced coupling behavior between the plasmonic modes and diffraction orders were studied in the core-shell SiO(2)/Au nanocylinder arrays (NCAs) using finite difference time domain (FDTD) simulations. New lattice plasmon modes (LPMs) are observed in the structures with high aspect ratio. Specifically, parallel coupling between the plasmonic modes and diffraction orders is obtained here, which shows different coupling behavior from orthogonal LPMs. Electromagnetic (EM) field distributions indicate that horizontal propagation of the magnetic or electric field component is responsible for the generation of these orthogonal and parallel LPMs, respectively. Radiative loss could be effectively suppressed when the height increases. This is important for the applications of fluorescence enhancement and nano laser. Further studies confirm that the LPMs associated with the superstrate diffraction orders could be well maintained even when the Au coating is imperfect. The interference from the substrate associated LPMs could be eliminated by cutting off the corresponding diffraction waves by inducing a Si(3)N(4) substrate. This study of coupling behavior in the core-shell NCAs enables a novel route to design and optimize the LPMs for applications of bio-sensing and nano laser. PMID:25835660

  13. Interactive animation of fault-tolerant parallel algorithms

    SciTech Connect

    Apgar, S.W.

    1992-02-01

    Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.

  14. Magnetic arrays

    DOEpatents

    Trumper, D.L.; Kim, W.; Williams, M.E.

    1997-05-20

    Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

  15. Magnetic arrays

    DOEpatents

    Trumper, David L. (Plaistow, NH); Kim, Won-jong (Cambridge, MA); Williams, Mark E. (Pelham, NH)

    1997-05-20

    Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

  16. High-resolution parallel-detection sensor array using piezo-phototronics effect

    DOEpatents

    Wang, Zhong L.; Pan, Caofeng

    2015-07-28

    A pressure sensor element includes a substrate, a first type of semiconductor material layer and an array of elongated light-emitting piezoelectric nanostructures extending upwardly from the first type of semiconductor material layer. A p-n junction is formed between each nanostructure and the first type semiconductor layer. An insulative resilient medium layer is infused around each of the elongated light-emitting piezoelectric nanostructures. A transparent planar electrode, disposed on the resilient medium layer, is electrically coupled to the top of each nanostructure. A voltage source is coupled to the first type of semiconductor material layer and the transparent planar electrode and applies a biasing voltage across each of the nanostructures. Each nanostructure emits light in an intensity that is proportional to an amount of compressive strain applied thereto.

  17. A General-Purpose CMOS Vision Chip with a Processor-Per-Pixel SIMD Array Piotr Dudek and Peter J. Hicks

    E-print Network

    Dudek, Piotr

    of processors, which are called APEs (analogue processing elements). This name reflects the fact, that data is represented and manipulated inside the APEs as analogue samples. However, the operation of the system computers. The APEs execute identical instructions on their local data in an SIMD (Single Instruction

  18. Electro-optical processor for optimal control

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Neuman, C.; Carlotto, M.

    1981-01-01

    An iterative optical processor has been developed for applications in the optimal control of advanced sensor systems. The processor is designed for the realization of the Richardson algorithm on bipolar data, using as input a linear array of LEDs. The usefulness of the processor has been demonstrated by the solution of the linear quadratic regulator problem for the optimal control signals of the F100 turbofan engine. In this case study, the algebraic Riccati equation matrix was solved by the use of a modified Kleinman algorithm along with the Richardson algorithm applied to a system of linear algebraic equations. Preliminary experimental results demonstrate the gradual convergence of the processor.

  19. Parallel rendering techniques for massively parallel visualization

    SciTech Connect

    Hansen, C.; Krogh, M.; Painter, J.

    1995-07-01

    As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.

  20. [6] H. Meijer and S. G. Akl. Optimal computation of prefix sums on a binary tree of processors. International Journal of Parallel Programming, 16:127--136, 1987.

    E-print Network

    Plaxton, Charles Gregory

    , Rice University, Department of Electrical and Computer Engi­ neering, February 1988. Ernst W. Mayr, E. W. Mayr, and M. K. Warmuth. Parallel approximation algorithms for bin packing. Technical Report

  1. A cost-effective methodology for the design of massively-parallel VLSI functional units

    NASA Technical Reports Server (NTRS)

    Venkateswaran, N.; Sriram, G.; Desouza, J.

    1993-01-01

    In this paper we propose a generalized methodology for the design of cost-effective massively-parallel VLSI Functional Units. This methodology is based on a technique of generating and reducing a massive bit-array on the mask-programmable PAcube VLSI array. This methodology unifies (maintains identical data flow and control) the execution of complex arithmetic functions on PAcube arrays. It is highly regular, expandable and uniform with respect to problem-size and wordlength, thereby reducing the communication complexity. The memory-functional unit interface is regular and expandable. Using this technique functional units of dedicated processors can be mask-programmed on the naked PAcube arrays, reducing the turn-around time. The production cost of such dedicated processors can be drastically reduced since the naked PAcube arrays can be mass-produced. Analysis of the the performance of functional units designed by our method yields promising results.

  2. FFT Computation with Systolic Arrays, A New Architecture

    NASA Technical Reports Server (NTRS)

    Boriakoff, Valentin

    1994-01-01

    The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.

  3. A kind of FPGA-based correlating Shack-Hartmann wave-front processor

    NASA Astrophysics Data System (ADS)

    Peng, Xiaofeng; Li, Mei; Rao, Changhui

    2008-12-01

    In solar adaptive optical system, absolute difference algorithm and correlation coefficients algorithm are widely used in Shack-Hartmann wave-front detection of the extended object. A Shack-Hartmann wave-front processor based on absolute difference algorithm is described in this paper. It is characterized by parallel and systolic architecture. The peak operation speed is over 23,000,000,000/s and calculation latency is 120us in a system with 6x6 sub-aperture array, which is 32x32 pixel in size each and for which the reference image is 16x16 pixel. Using this processor, frame rate of the CCD (Charge Coupled Device) can be up to 1000 Hz, and with smaller sub-aperture size, the frame rate can be even higher. Built in a single FPGA (Field Programmable Gate Array), it is low-cost, compact and easy to be modified.

  4. Parallel recognition of cancer cells using an addressable array of solid-state micropores.

    PubMed

    Ilyas, Azhar; Asghar, Waseem; Kim, Young-tae; Iqbal, Samir M

    2014-12-15

    Early stage detection and precise quantification of circulating tumor cells (CTCs) in the peripheral blood of cancer patients are important for early diagnosis. Early diagnosis improves the effectiveness of the therapy and results in better prognosis. Several techniques have been used for CTC detection but are limited by their need for dye tagging, low throughput and lack of statistical reliability at single cell level. Solid-state micropores can characterize each cell in a sample providing interesting information about cellular populations. We report a multi-channel device which utilized solid-state micropores array assembly for simultaneous measurement of cell translocation. This increased the throughput of measurement and as the cells passed the micropores, tumor cells showed distinctive current blockade pulses, when compared to leukocytes. The ionic current across each micropore channel was continuously monitored and recorded. The measurement system not only increased throughput but also provided on-chip cross-relation. The whole blood was lysed to get rid of red blood cells, so the blood dilution was not needed. The approach facilitated faster processing of blood samples with tumor cell detection efficiency of about 70%. The design provided a simple and inexpensive method for rapid and reliable detection of tumor cells without any cell staining or surface functionalization. The device can also be used for high throughput electrophysiological analysis of other cell types. PMID:25038540

  5. Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis

    E-print Network

    Zoli, Andrea; De Rosa, Adriano; Aboudan, Alessio; Fioretti, Valentina; De Cesare, Giovanni; Marx, Ramin

    2015-01-01

    The Cherenkov Telescope Array (CTA) is the next generation observatory for the study of very high-energy gamma rays from about 20 GeV up to 300 TeV. Thanks to the large effective area and field of view, the CTA observatory will be characterized by an unprecedented sensitivity to transient flaring gamma-ray phenomena compared to both current ground (e.g. MAGIC, VERITAS, H.E.S.S.) and space (e.g. Fermi) gamma-ray telescopes. In order to trigger the astrophysics community for follow-up observations, or being able to quickly respond to external science alerts, a fast analysis pipeline is crucial. This will be accomplished by means of a Real-Time Analysis (RTA) pipeline, a fast and automated science alert trigger system, becoming a key system of the CTA observatory. Among the CTA design key requirements to the RTA system, the most challenging is the generation of alerts within 30 seconds from the last acquired event, while obtaining a flux sensitivity not worse than the one of the final analysis by more than a fac...

  6. An efficient parallel algorithm for O(N^2) direct summation method and its variations on distributed-memory parallel machines

    E-print Network

    Junichiro Makino

    2001-08-27

    We present a novel, highly efficient algorithm to parallelize O(N^2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in which all processors have complete copies of the N-body system, has the serious problem that the communication-computation ratio increases as we increase the number of processors, since the communication cost is independent of the number of processors. In the new algorithm, p processors are organized as a $\\sqrt{p}\\times \\sqrt{p}$ two-dimensional array. Each processor has $N/\\sqrt{p}$ particles, but the data are distributed in such a way that complete system is presented if we look at any row or column consisting of $\\sqrt{p}$ processors. In this algorithm, the communication cost scales as $N /\\sqrt{p}$, while the calculation cost scales as $N^2/p$. Thus, we can use a much larger number of processors without losing efficiency compared to what was practical with previously known algorithms.

  7. Parallel asynchronous systems and image processing algorithms

    NASA Technical Reports Server (NTRS)

    Coon, D. D.; Perera, A. G. U.

    1989-01-01

    A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.

  8. Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors

    NASA Astrophysics Data System (ADS)

    Aghili Yajadda, Mir Massoud

    2014-10-01

    We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV-50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

  9. Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors

    SciTech Connect

    Aghili Yajadda, Mir Massoud

    2014-10-21

    We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV–50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

  10. High performance selectively oxidized VCSELs and arrays for parallel high-speed optical interconnects

    NASA Astrophysics Data System (ADS)

    Mederer, Felix; Grabherr, Martin; Eberhard, Franz; Ecker, Irene; Jäger, Roland; Joos, Jürgen; Jung, Chistian; Kicherer, Max; King, Roger; Schnitzer, Peter; Unold, Heiko; Wiedenmann, Dieter; Ebeling, Karl Joachim

    We introduce a new layout for high-bandwidth single-mode selectively oxidized vertical-cavity surface-emitting laser (VCSEL) arrays operating at 980 nm or 850 nm emission wavelength for substrate or epitaxial side emission. Coplanar feeding lines and polyimide passivation are used to reduce electrical parasitics in top-emitting GaAs and bottom-emitting InGaAs VCSELs. In order to enhance fundamental single-mode emission for larger devices of reduced series resistance a surface relief transverse mode filter is employed. Fabricated VCSELs are applied in various interconnect schemes. In detail, we demonstrate 2.5 Gb/s pseudo-random data transmission with GaAs VCSELs at an emission wavelength of ?=835 nm over 120 ?m core diameter step index plastic-optical fiber (POF) of 2.5 m length. InGaAs quantum-well based VCSELs at 935 nm emission wavelength are investigated for use in perfluorinated graded-index plastic-optical fiber (GI-POF) links. We obtain a 7 Gb/s pseudo random bit sequence (PRBS) non-return-to-zero (NRZ) data transmission over 80 m long 155 ?m diameter GI-POF. We investigate data transmission over standard 1300 nm, 9 ?m core diameter single-mode fiber using selectively oxidized single-mode GaAs or InGaAs VCSELs. We achieve biased 3 Gb/s and bias-free 1 Gb/s pseudo-random data transmission over 4.3 km at 830 nm emission wavelength where a simple fiber mode filter is used to suppress intermodal dispersion caused by the second order fiber mode. For the first time, we demonstrate 12.5 Gb/s data rate transmission of PRBS signals over 100 m graded-index multimode fiber or 1 km single-mode fiber using high performance single-mode GaAs VCSELs of 12.3 GHz modulation bandwidth emitting at ?=850 nm. Longer wave-length InGaAs VCSELs with emission at ?=1130 nm are used to transmit 2.5 Gb/s signals over 10 km of 9 ?m standard fiber. For all data transmission experiments bit-error rates (BER) remain better than 10-11 for transmission of PRBS signals for back-to-back (BTB) testing as well as for fiber transmission.

  11. Quantitative analysis of RNA-protein interactions on a massively parallel array for mapping biophysical and evolutionary landscapes

    PubMed Central

    Buenrostro, Jason D.; Chircus, Lauren M.; Araya, Carlos L.; Layton, Curtis J.; Chang, Howard Y.; Snyder, Michael P.; Greenleaf, William J.

    2015-01-01

    RNA-protein interactions drive fundamental biological processes and are targets for molecular engineering, yet quantitative and comprehensive understanding of the sequence determinants of affinity remains limited. Here we repurpose a high-throughput sequencing instrument to quantitatively measure binding and dissociation of MS2 coat protein to >107 RNA targets generated on a flow-cell surface by in situ transcription and inter-molecular tethering of RNA to DNA. We decompose the binding energy contributions from primary and secondary RNA structure, finding that differences in affinity are often driven by sequence-specific changes in association rates. By analyzing the biophysical constraints and modeling mutational paths describing the molecular evolution of MS2 from low- to high-affinity hairpins, we quantify widespread molecular epistasis, and a long-hypothesized structure-dependent preference for G:U base pairs over C:A intermediates in evolutionary trajectories. Our results suggest that quantitative analysis of RNA on a massively parallel array (RNAMaP) relationships across molecular variants. PMID:24727714

  12. Infrared laser transillumination CT imaging system using parallel fiber arrays and optical switches for finger joint imaging

    NASA Astrophysics Data System (ADS)

    Sasaki, Yoshiaki; Emori, Ryota; Inage, Hiroki; Goto, Masaki; Takahashi, Ryo; Yuasa, Tetsuya; Taniguchi, Hiroshi; Devaraj, Balasigamani; Akatsuka, Takao

    2004-05-01

    The heterodyne detection technique, on which the coherent detection imaging (CDI) method founds, can discriminate and select very weak, highly directional forward scattered, and coherence retaining photons that emerge from scattering media in spite of their complex and highly scattering nature. That property enables us to reconstruct tomographic images using the same reconstruction technique as that of X-Ray CT, i.e., the filtered backprojection method. Our group had so far developed a transillumination laser CT imaging method based on the CDI method in the visible and near-infrared regions and reconstruction from projections, and reported a variety of tomographic images both in vitro and in vivo of biological objects to demonstrate the effectiveness to biomedical use. Since the previous system was not optimized, it took several hours to obtain a single image. For a practical use, we developed a prototype CDI-based imaging system using parallel fiber array and optical switches to reduce the measurement time significantly. Here, we describe a prototype transillumination laser CT imaging system using fiber-optic based on optical heterodyne detection for early diagnosis of rheumatoid arthritis (RA), by demonstrating the tomographic imaging of acrylic phantom as well as the fundamental imaging properties. We expect that further refinements of the fiber-optic-based laser CT imaging system could lead to a novel and practical diagnostic tool for rheumatoid arthritis and other joint- and bone-related diseases in human finger.

  13. Highly Parallel Computing Architectures by using Arrays of Quantum-dot Cellular Automata (QCA): Opportunities, Challenges, and Recent Results

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Toomarian, Benny N.

    2000-01-01

    There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA-based architectures for highly parallel and systolic computation of signal/image processing applications, such as FFT and Wavelet and Wlash-Hadamard Transforms.

  14. Scheduling on the MasPar SIMD parallel computer 

    E-print Network

    Perkins, Keith Douglas

    1995-01-01

    lowering communication costs by scheduling related tasks onto the same processor and assuming that processors may be dynamically allocated during task scheduling. Comparisons are made between serial and parallel versions of the task scheduler. The parallel...

  15. Implementing Access to Data Distributed on Many Processors

    NASA Technical Reports Server (NTRS)

    James, Mark

    2006-01-01

    A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization.

  16. Algorithmically specialized parallel computers

    SciTech Connect

    Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.

    1985-01-01

    This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.

  17. Architecture design of a FPGA-based wavefront processor for correlating a Shack-Hartmann sensor

    NASA Astrophysics Data System (ADS)

    Peng, Xiaofeng; Li, Mei; Rao, ChangHui

    2008-12-01

    During solar observation, atmosphere turbulence usually blur the solar image coming from solar telescope. In order to improve the quality of solar image, solar Adaptive Optical (AO) system is equipped. In a typical solar AO system, Correlating Shack-Hartmann (SH) wavefront sensor is used to detect the aberration of the blurred image. To detect the aberration as well as possible, frame rate of CCD working after the SH sensor must be fast enough to keep pace with the variation of turbulence. CCD with 1000 Hz frame rate is very common in solar adaptive optical system. What's more, next generation telescope is so large that resolution of CCD becomes higher and higher. So it requires the wavefront processor a huge amount of processing power. As FPGA (Field Programmable Gate Array) technology becomes more powerful, they can provide amazing processing ability by high speed and parallel processing. This paper gives out a design of FPGA-based wavefront processor in solar adaptive optical system. It is characterized by pipeline and parallel architecture. The peak operation speed is over 86G/s and calculation latency is 7.04 us in a system with 16×16 sub-aperture array, which is 16×16 pixel in size each and for which the reference image is 8×8 pixel. Using this processor, frame rate of the CCD can be up to 8800 fps. Built in a single FPGA, it is low-cost, compact and easy to be upgraded.

  18. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    SciTech Connect

    Robert W. Numrich

    2008-04-22

    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the co-array model to other languages in a small experimental version of Co-array Python. Another collaborative project defined a Fortran 95 interface to ARMCI to encourage Fortran programmers to use the one-sided communication model in anticipation of their conversion to the co-array model later. A collaborative project with the Earth Sciences community at NASA Goddard and GFDL experimented with the co-array model within computational kernels related to their climate models, first using CafLib and then extending the co-array model to use design patterns. Future work will build on the design-pattern idea with a redesign of CafLib as a true object-oriented library using Fortran 2003 and as a parallel numerical library using Fortran 2008.

  19. Fault-tolerant computer architecture based on INMOS transputer processor

    NASA Technical Reports Server (NTRS)

    Ortiz, Jorge L.

    1987-01-01

    Redundant processing was used for several years in mission flight systems. In these systems, more than one processor performs the same task at the same time but only one processor is actually in real use. A fault-tolerance computer architecture based on the features provided by INMOS Transputers is presented. The Transputer architecture provides several communication links that allow data and command communication with other Transputers without the use of a bus. Additionally the Transputer allows the use of parallel processing to increase the system speed considerably. The processor architecture consists of three processors working in parallel keeping all the processors at the same operational level but only one processor is in real control of the process. The design allows each Transputer to perform a test to the other two Transputers and report the operating condition of the neighboring processors. A graphic display was developed to facilitate the identification of any problem by the user.

  20. Opto-electronic morphological processor

    NASA Technical Reports Server (NTRS)

    Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

    1993-01-01

    The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

  1. High performance parallel computers for science: New developments at the Fermilab advanced computer program

    SciTech Connect

    Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

    1988-08-01

    Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

  2. Periodic parallel array of nanopillars and nanoholes resulting from colloidal stripes patterned by geometrically confined evaporative self-assembly for unique anisotropic wetting.

    PubMed

    Li, Xiangmeng; Wang, Chunhui; Shao, Jinyou; Ding, Yucheng; Tian, Hongmiao; Li, Xiangming; Wang, Li

    2014-11-26

    In this paper we present an economical process to create anisotropic microtextures based on periodic parallel stripes of monolayer silica nanoparticles (NPs) patterned by geometrically confined evaporative self-assembly (GCESA). In the GCESA process, a straight meniscus of a colloidal dispersion is initially formed in an opened enclosure, which is composed of two parallel plates bounded by a U-shaped spacer sidewall on three sides with an evaporating outlet on the fourth side. Lateral evaporation of the colloidal dispersion leads to periodic "stick-slip" receding of the meniscus (evaporative front), as triggered by the "coffee-ring" effect, promoting the assembly of silica NPs into periodic parallel stripes. The morphology of stripes can be well controlled by tailoring process variables such as substrate wettability, NP concentration, temperature, and gap height, etc. Furthermore, arrayed patterns of nanopillars or nanoholes are generated on a silicon wafer using the as-prepared colloidal stripes as an etching mask or template. Such arrayed patterns can reveal unique anisotropic wetting properties, which have a large contact angle hysteresis viewing from both the parallel and perpendicular directions in addition to a large wetting anisotropy. PMID:25353399

  3. SCAN secure processor and its biometric capabilities

    NASA Astrophysics Data System (ADS)

    Kannavara, Raghudeep; Mertoguno, Sukarno; Bourbakis, Nikolaos

    2011-04-01

    This paper presents the design of the SCAN secure processor and its extended instruction set to enable secure biometric authentication. The SCAN secure processor is a modified SparcV8 processor architecture with a new instruction set to handle voice, iris, and fingerprint-based biometric authentication. The algorithms for processing biometric data are based on the local global graph methodology. The biometric modules are synthesized in reconfigurable logic and the results of the field-programmable gate array (FPGA) synthesis are presented. We propose to implement the above-mentioned modules in an off-chip FPGA co-processor. Further, the SCAN-secure processor will offer a SCAN-based encryption and decryption of 32 bit instructions and data.

  4. Neurovision processor for designing intelligent sensors

    NASA Astrophysics Data System (ADS)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  5. Online track processor for the CDF upgrade

    SciTech Connect

    E. J. Thomson et al.

    2002-07-17

    A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented.

  6. Reconfigurable VLSI architecture for a database processor

    SciTech Connect

    Oflazer, K.

    1983-01-01

    This work brings together the processing potential offered by regularly structured VLSI processing units and the architecture of a database processor-the relational associative processor (RAP). The main motivations are to integrate a RAP cell processor on a few VLSI chips and improve performance by employing procedures exploiting these VLSI chips and the system level reconfigurability of processing resources. The resulting VLSI database processor consists of parallel processing cells that can be reconfigured into a large processor to execute the hard operations of projection and semijoin efficiently. It is shown that such a configuration can provide 2 to 3 orders of magnitude of performance improvement over previous implementations of the RAP system in the execution of such operations. 27 refs.

  7. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    SciTech Connect

    Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee; Buckner, Mark A

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

  8. Processor-Group Aware Runtime Support for Shared-and Global-Address Space Models

    SciTech Connect

    Krishnan, Manoj Kumar; Tipparaju, Vinod; Palmer, Bruce; Nieplocha, Jarek

    2004-12-07

    Exploiting multilevel parallelism using processor groups is becoming increasingly important for programming on high-end systems. This paper describes a group-aware run-time support for shared-/global- address space programming models. The current effort has been undertaken in the context of the Aggregate Remote Memory Copy Interface (ARMCI) [5], a portable runtime system used as a communication layer for Global Arrays [6], Co-Array Fortran (CAF) [9], GPSHMEM [10], Co-Array Python [11], and also end-user applications. The paper describes the management of shared memory, integration of shared memory communication and RDMA on clusters with SMP nodes, and registration. These are all required for efficient multi- method and multi-protocol communication on modern systems. Focus is placed on techniques for supporting process groups while maximizing communication performance and efficiently managing global memory system-wide.

  9. Algorithmically Specialized Parallel Architecture For Robotics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    Computing system called Robot Mathematics Processor (RMP) contains large number of processor elements (PE's) connected in various parallel and serial combinations reconfigurable via software. Special-purpose architecture designed for solving diverse computational problems in robot control, simulation, trajectory generation, workspace analysis, and like. System an MIMD-SIMD parallel architecture capable of exploiting parallelism in different forms and at several computational levels. Major advantage lies in design of cells, which provides flexibility and reconfigurability superior to previous SIMD processors.

  10. Reconfigurable computer array: The bridge between high speed sensors and low speed computing

    SciTech Connect

    Robinson, S.H.; Caffrey, M.P.; Dunham, M.E.

    1998-06-16

    A universal limitation of RF and imaging front-end sensors is that they easily produce data at a higher rate than any general-purpose computer can continuously handle. Therefore, Los Alamos National Laboratory has developed a custom Reconfigurable Computing Array board to support a large variety of processing applications including wideband RF signals, LIDAR and multi-dimensional imaging. The boards design exploits three key features to achieve its performance. First, there are large banks of fast memory dedicated to each reconfigurable processor and also shared between pairs of processors. Second, there are dedicated data paths between processors, and from a processor to flexible I/O interfaces. Third, the design provides the ability to link multiple boards into a serial and/or parallel structure.

  11. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  12. Scioto: A Framework for Global-ViewTask Parallelism

    SciTech Connect

    Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

    2008-09-09

    We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

  13. Parallel asynchronous hardware implementation of image processing algorithms

    NASA Technical Reports Server (NTRS)

    Coon, Darryl D.; Perera, A. G. U.

    1990-01-01

    Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.

  14. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad (Rochester, MN)

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  15. Load Balancing in Processor Sharing Systems Eitan Altman (INRIA)

    E-print Network

    Ayesta, Urtzi

    by the central unit or by the downloader · Downloads progress in parallel Processor Sharing (PS) at each serverLoad Balancing in Processor Sharing Systems Eitan Altman (INRIA) Urtzi Ayesta (LAAS is the optimal routing policy? · Centralized setting: dispatcher takes decisions · Decentralized setting

  16. Transitive closure on the imagine stream processor

    SciTech Connect

    Griem, Gorden; Oliker, Leonid

    2003-11-11

    The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

  17. Architectures for reasoning in parallel

    NASA Technical Reports Server (NTRS)

    Hall, Lawrence O.

    1989-01-01

    The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.

  18. Parallel processing of natural language

    SciTech Connect

    Chang, H.O.

    1986-01-01

    Two types of parallel natural language processing are studied in this work: (1) the parallelism between syntactic and nonsyntactic processing and (2) the parallelism within syntactic processing. It is recognized that a syntactic category can potentially be attached to more than one node in the syntactic tree of a sentence. Even if all the attachments are syntactically well-formed, nonsyntactic factors such as semantic and pragmatic consideration may require one particular attachment. Syntactic processing must synchronize and communicate with nonsyntactic processing. Two syntactic processing algorithms are proposed for use in a parallel environment: Early's algorithm and the LR(k) algorithm. Conditions are identified to detect the syntactic ambiguity and the algorithms are augmented accordingly. It is shown that by using nonsyntactic information during syntactic processing, backtracking can be reduced, and the performance of the syntactic processor is improved. For the second type of parallelism, it is recognized that one portion of a grammar can be isolated from the rest of the grammar and be processed by a separate processor. A partial grammar of a larger grammar is defined. Parallel syntactic processing is achieved by using two processors concurrently: the main processor (mp) and the two processors concurrently: the main processor (mp) and the auxiliary processor (ap).

  19. COMPUTER ARCHITECTURE WITH ASSOCIATIVE PROCESSOR REPLACING LAST

    E-print Network

    Ginosar, Ran

    . -------------------- -------------------- 1 INTRODUCTION achine learning, data mining, network routing, search engines and other big data data storage and processing, and functions as a parallel SIMD processor and a memory at the same time archi- tectures include vector, or SIMD coprocessors [1][16][24]. However data transfer between

  20. Parallel Earley's parser and its application to syntactic image analysis

    SciTech Connect

    Chiang, Y.P.; Fu, K.S.

    1983-01-01

    A complete Earley parser which includes recognition and parse extraction has been implemented on a triangular array of processors. The detailed analysis of the complete parser is given. The recognition algorithm is executed in parallel by adopting a new operator, x/sup */, and restricting the input context-free grammar to be lamda-free. The parse extraction algorithm which follows recognition uses a nonrecursive subroutine to generate the correct right-parse in parallel. A special busing arrangement within this array enables the right data to reach the right place at the right time. Simulation examples are provided. The results show that when a string of length >n> is under testing, at the system time 2>n> + 1, the correct right-parse will be obtained if the string is accepted. 15 references.

  1. Model-driven mapping onto distributed memory parallel computers

    NASA Technical Reports Server (NTRS)

    Sussman, Alan

    1992-01-01

    The author addresses the problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine in the context of building a mapping compiler for a distributed memory parallel machine. He demonstrates the effectiveness of using execution models to select the best mapping technique from among those available for a given program segment on a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs on a linear processor array, it is shown that selecting the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from a mapping compiler for the Warp systolic array machine show that the execution models considered are accurate enough to select the best mapping technique for a given program.

  2. Design and parallel fabrication of wire-grid polarization arrays for polarization-resolved imaging at 1:55 m

    E-print Network

    Klotzkin, David

    of small, orthogonal wire-grid polarizers (WGPs) that can be matched to individual detector pixels, and we of liquid crystal material to patterned gold domains for visible imaging [4] have been fabricated. Wire a polarization wire grid suitable for integration with a detector array at a wavelength of 1:55 m fabricated

  3. Design and microfabrication of a high-aspect-ratio PDMS microbeam array for parallel nanonewton force measurement and protein printing

    NASA Astrophysics Data System (ADS)

    Sasoglu, F. M.; Bohl, A. J.; Layton, B. E.

    2007-03-01

    Cell and protein mechanics has applications ranging from cellular development to tissue engineering. Techniques such as magnetic tweezers, optic tweezers and atomic force microscopy have been used to measure cell deformation forces of the order of piconewtons to nanonewtons. In this study, an array of polymeric polydimethylsiloxane (PDMS) microbeams with diameters of 10-40 µm and lengths of 118 µm was fabricated from Sylgard® with curing agent concentrations ranging from 5% to 20%. The resulting spring constants were 100-300 nN µm-1. The elastic modulus of PDMS was determined experimentally at different curing agent concentrations and found to be 346 kPa to 704 kPa in a millimeter-scale array and ~1 MPa in a microbeam array. Additionally, the microbeam array was used to print laminin for the purpose of cell adhesion. Linear and nonlinear finite element analyses are presented and compared to the closed-from solution. The highly compliant, transparent, biocompatible PDMS may offer a method for more rapid throughput in cell and protein mechanics force measurement experiments with sensitivities necessary for highly compliant structures such as axons.

  4. Task assignment in parallel processor systems 

    E-print Network

    Manoharan, Sathiamoorthy

    1993-01-01

    A generic object-oriented simulation platform is developed in order to conduct experiments on the performance of assignment schemes. The simulation platform, called Genesis, is generic in the sense that it can model the ...

  5. SUDS : automatic parallelization for raw processors

    E-print Network

    Frank, Matthew I

    2003-01-01

    A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind ...

  6. Appears in the Proceedings of the 2002 International Conference on Computer Design. The Imagine Stream Processor

    E-print Network

    Owens, John

    The Imagine Stream Processor is a single-chip pro- grammable media processor with 48 parallel ALUs. At 400 MHz single-chip processor must issue up to 48 instructions/cycle and provide up to 144 words/cycle of data

  7. Sandia secure processor : a native Java processor.

    SciTech Connect

    Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

    2003-08-01

    The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

  8. Parallel Algorithms for Computer Vision on the Connection Machine

    E-print Network

    Little, James J.

    1986-11-01

    The Connection Machine is a fine-grained parallel computer having up to 64K processors. It supports both local communication among the processors, which are situated in a two-dimensional mesh, and high-bandwidth ...

  9. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing

    PubMed Central

    Park, Hansoo; Kim, Jong-Il; Ju, Young Seok; Gokcumen, Omer; Mills, Ryan E; Kim, Sheehyun; Lee, Seungbok; Suh, Dongwhan; Hong, Dongwan; Kang, Hyunseok Peter; Yoo, Yun Joo; Shin, Jong-Yeon; Kim, Hyun-Jin; Yavartanoo, Maryam; Chang, Young Wha; Ha, Jung-Sook; Chong, Wilson; Hwang, Ga-Ram; Darvishi, Katayoon; Kim, HyeRan; Yang, Song Ju; Yang, Kap-Seok; Kim, Hyungtae; Hurles, Matthew E; Scherer, Stephen W; Carter, Nigel P; Tyler-Smith, Chris; Lee, Charles; Seo, Jeong-Sun

    2012-01-01

    Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3× coverage) and two Asian genomes (AK1, with 27.8× coverage and AK2, with 32.0× coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine. PMID:20364138

  10. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, D.B.

    1994-07-19

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

  11. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, Dario B. (DeSoto, TX)

    1994-01-01

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

  12. A Parallel Tree Code

    E-print Network

    John Dubinski

    1996-03-18

    We describe a new implementation of a parallel N-body tree code. The code is load-balanced using the method of orthogonal recursive bisection to subdivide the N-body system into independent rectangular volumes each of which is mapped to a processor on a parallel computer. On the Cray T3D, the load balance in the range of 70-90\\% depending on the problem size and number of processors. The code can handle simulations with $>$ 10 million particles roughly a factor of 10 greater than allowed in vectorized tree codes.

  13. Online track processor for the CDF upgrade

    SciTech Connect

    Ciobanu, C.; Gertenslager, J.; Hoftiezer, J.

    1999-08-01

    A trigger track processor is being designed for the CDF upgrade. This processor identifies high momentum (P{sub T} > 1.5 GeV/c) charged tracks in the new central outer tracking chamber for CDF II. The track processor is called the Extremely Fast Tracker (XFT). The XFT design is highly parallel to handle the input rate of 183 Gbits/sec and output rate of 44 Gbits/sec. The processor is pipelined and reports the results for a new event every 132 ns. The processor uses three stages, hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow for in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. Prototypes of each of these modules have been designed and built, and are presently undergoing testing. An overview of the track processor and results of testing are presented.

  14. An Experimental Digital Image Processor

    NASA Astrophysics Data System (ADS)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  15. SPROC: A multiple-processor DSP IC

    NASA Technical Reports Server (NTRS)

    Davis, R.

    1991-01-01

    A large, single-chip, multiple-processor, digital signal processing (DSP) integrated circuit (IC) fabricated in HP-Cmos34 is presented. The innovative architecture is best suited for analog and real-time systems characterized by both parallel signal data flows and concurrent logic processing. The IC is supported by a powerful development system that transforms graphical signal flow graphs into production-ready systems in minutes. Automatic compiler partitioning of tasks among four on-chip processors gives the IC the signal processing power of several conventional DSP chips.

  16. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195

  17. Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics

    PubMed Central

    Shao, Yiping; Sun, Xishan; Lan, Kejian A.; Bircher, Chad; Lou, Kai; Deng, Zhi

    2014-01-01

    In this study, we developed a prototype animal PET by applying several novel technologies to use the solid-state photomultiplier (SSPM) arrays for measuring the depth-of-interaction (DOI) and improving imaging performance. Each PET detector has an 8×8 array of about 1.9×1.9×30.0 mm3 lutetium-yttrium-oxyorthosilicate (LYSO) scintillators, with each end optically connected to a SSPM array (16-channel in a 4×4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3×3 mm2, and its output is read by a custom-developed application-specific-integrated-circuit (ASIC) to directly convert analog signals to digital timing pulses that encode the interaction information. These pulses are transferred to and be decoded by a field-programmable-gate-array (FPGA) based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable using flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered-subset-expectation-maximization image reconstruction was implemented. The measured mean energy, coincidence timing, and DOI resolution for a crystal were about 17.6%, 2.8 ns, and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner. PMID:24556629

  18. Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics

    NASA Astrophysics Data System (ADS)

    Shao, Yiping; Sun, Xishan; Lan, Kejian A.; Bircher, Chad; Lou, Kai; Deng, Zhi

    2014-03-01

    In this study, we developed a prototype animal PET by applying several novel technologies to use solid-state photomultiplier (SSPM) arrays to measure the depth of interaction (DOI) and improve imaging performance. Each PET detector has an 8 × 8 array of about 1.9 × 1.9 × 30.0 mm3 lutetium-yttrium-oxyorthosilicate scintillators, with each end optically connected to an SSPM array (16 channels in a 4 × 4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3 × 3 mm2, and its output is read by a custom-developed application-specific integrated circuit to directly convert analogue signals to digital timing pulses that encode the interaction information. These pulses are transferred to and are decoded by a field-programmable gate array-based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable the use of flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered subset expectation maximization image reconstruction was implemented. The measured mean energy, coincidence timing and DOI resolution for a crystal were about 17.6%, 2.8 ns and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner.

  19. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-03-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

  20. Rapid, Single-Molecule Assays in Nano/Micro-Fluidic Chips with Arrays of Closely Spaced Parallel Channels Fabricated by Femtosecond Laser Machining

    PubMed Central

    Canfield, Brian K.; King, Jason K.; Robinson, William N.; Hofmeister, William H.; Davis, Lloyd M.

    2014-01-01

    Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

  1. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, M.S.; Strip, D.R.

    1996-01-30

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

  2. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, Michael S. (Ridgefield, CT); Strip, David R. (Albuquerque, NM)

    1996-01-01

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

  3. Parallel nearest neighbor calculations

    NASA Astrophysics Data System (ADS)

    Trease, Harold

    We are just starting to parallelize the nearest neighbor portion of our free-Lagrange code. Our implementation of the nearest neighbor reconnection algorithm has not been parallelizable (i.e., we just flip one connection at a time). In this paper we consider what sort of nearest neighbor algorithms lend themselves to being parallelized. For example, the construction of the Voronoi mesh can be parallelized, but the construction of the Delaunay mesh (dual to the Voronoi mesh) cannot because of degenerate connections. We will show our most recent attempt to tessellate space with triangles or tetrahedrons with a new nearest neighbor construction algorithm called DAM (Dial-A-Mesh). This method has the characteristics of a parallel algorithm and produces a better tessellation of space than the Delaunay mesh. Parallel processing is becoming an everyday reality for us at Los Alamos. Our current production machines are Cray YMPs with 8 processors that can run independently or combined to work on one job. We are also exploring massive parallelism through the use of two 64K processor Connection Machines (CM2), where all the processors run in lock step mode. The effective application of 3-D computer models requires the use of parallel processing to achieve reasonable "turn around" times for our calculations.

  4. Semi-automated alignment and quantification of peaks using parallel factor analysis for comprehensive two-dimensional liquid chromatography-diode array detector data sets

    PubMed Central

    Allen, Robert C.

    2012-01-01

    Parallel factor analysis was used to quantify the relative concentrations of peaks within four-way comprehensive two dimensional liquid chromatography-diode array detector data sets. Since parallel factor analysis requires that the retention times of peaks between each injection are reproducible, a semi-automated alignment method was developed that utilizes the spectra of the compounds to independently align the peaks without the need for a reference injection. Peak alignment is achieved by shifting the optimized chromatographic component profiles from a three-way parallel factor analysis model applied to each injection. To ensure accurate shifting, components are matched up based on their spectral signature and the position of the peak in both chromatographic dimensions. The degree of shift, for each peak, is determined by calculating the distance between the median data point of the respective dimension (in either the second or first chromatographic dimension) and the maximum data point of the peak furthest from the median. All peaks that were matched to this peak are then aligned to this common retention data point. Target analyte recoveries for four simulated data sets were within 2 % of 100 % recovery in all cases. Two different experimental data sets were also evaluated. Precision of quantification of two spectrally similar and partially coeluting peaks present in urine was as good as or better than 4 %. Good results were also obtained for a challenging analysis of phenytoin in waste water effluent, where the results of the semi-automated alignment method agreed with the reference LC-LCMS/MS method within the precision of the methods. PMID:22444567

  5. Simulation of an array-based neural net model

    NASA Technical Reports Server (NTRS)

    Barnden, John A.

    1987-01-01

    Research in cognitive science suggests that much of cognition involves the rapid manipulation of complex data structures. However, it is very unclear how this could be realized in neural networks or connectionist systems. A core question is: how could the interconnectivity of items in an abstract-level data structure be neurally encoded? The answer appeals mainly to positional relationships between activity patterns within neural arrays, rather than directly to neural connections in the traditional way. The new method was initially devised to account for abstract symbolic data structures, but it also supports cognitively useful spatial analogue, image-like representations. As the neural model is based on massive, uniform, parallel computations over 2D arrays, the massively parallel processor is a convenient tool for simulation work, although there are complications in using the machine to the fullest advantage. An MPP Pascal simulation program for a small pilot version of the model is running.

  6. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

  7. Processor Allocation on Cplant: Achieving General Processor Locality Using One-Dimensional Allocation Strategies

    E-print Network

    Bunde, David

    to the memory of Steve Seiden, who was killed in a tragic cycling accident on June 11, 2002. Abstract jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomputers submit parallel jobs to a job queue. When a job is scheduled to run, it is as- signed to a set of processors

  8. Processor Allocation on Cplant: Achieving General Processor Locality Using OneDimensional Allocation Strategies

    E-print Network

    Bender, Michael

    to the memory of Steve Seiden, who was killed in a tragic cycling accident on June 11, 2002. Abstract jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomputers submit parallel jobs to a job queue. When a job is scheduled to run, it is as­ signed to a set of processors

  9. Stochastic propagation of an array of parallel cracks: Exploratory work on matrix fatigue damage in composite laminates

    SciTech Connect

    Williford, R.E.

    1989-09-01

    Transverse cracking of polymeric matrix materials is an important fatigue damage mechanism in continuous-fiber composite laminates. The propagation of an array of these cracks is a stochastic problem usually treated by Monte Carlo methods. However, this exploratory work proposes an alternative approach wherein the Monte Carlo method is replaced by a more closed-form recursion relation based on fractional Brownian motion.'' A fractal scaling equation is also proposed as a substitute for the more empirical Paris equation describing individual crack growth in this approach. Preliminary calculations indicate that the new recursion relation is capable of reproducing the primary features of transverse matrix fatigue cracking behavior. Although not yet fully tested or verified, this cursion relation may eventually be useful for real-time applications such as monitoring damage in aircraft structures.

  10. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    PubMed

    Martinez, Hector; Tarraga, Joaquin; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquin; Quintana-Orti, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, [Formula: see text] ([Formula: see text] is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of [Formula: see text], on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR. PMID:26451814

  11. Fabrication and Evaluation of a Micro(Bio)Sensor Array Chip for Multiple Parallel Measurements of Important Cell Biomarkers

    PubMed Central

    Pemberton, Roy M.; Cox, Timothy; Tuffin, Rachel; Drago, Guido A.; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C.; Davies, Rhodri; Jackson, Simon K.; Kenna, Gerry; Luxton, Richard; Hart, John P.

    2014-01-01

    This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

  12. Programmable DNA-mediated multitasking processor

    E-print Network

    Shu, Jian-Jun; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-01-01

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.

  13. Programmable DNA-Mediated Multitasking Processor.

    PubMed

    Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-04-30

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices. PMID:25874653

  14. Parallel fault-tolerant robot control

    NASA Astrophysics Data System (ADS)

    Hamilton, Deirdre L.; Bennett, John K.; Walker, Ian D.

    1992-11-01

    Most robot controllers today employ a single processor architecture. As robot control requirements become more complex, these serial controllers have difficulty providing the desired response time. Additionally, with robots being used in environments that are hazardous or inaccessible to humans, fault-tolerant robotic systems are particularly desirable. A uniprocessor control architecture cannot offer tolerance of processor faults. Use of multiple processors for robot control offers two advantages over single processor systems. Parallel control provides a faster response, which in turn allows a finer granularity of control. Processor fault tolerance is also made possible by the existence of multiple processors. There is a trade-off between performance and the level of fault tolerance provided. This paper describes a shared memory multiprocessor robot controller that is capable of providing high performance and processor fault tolerance. We evaluate the performance of this controller, and demonstrate how performance and processor fault tolerance can be balanced in a cost- effective manner.

  15. Buffered coscheduling for parallel programming and enhanced fault tolerance

    DOEpatents

    Petrini, Fabrizio (Los Alamos, NM); Feng, Wu-chun (Los Alamos, NM)

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  16. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  17. Artificial intelligence in parallel

    SciTech Connect

    Waldrop, M.M.

    1984-08-10

    The current rage in the Artificial Intelligence (AI) community is parallelism: the idea is to build machines with many independent processors doing many things at once. The upshot is that about a dozen parallel machines are now under development for AI alone. As might be expected, the approaches are diverse yet there are a number of fundamental issues in common: granularity, topology, control, and algorithms.

  18. A structural approach to the photonic processor

    NASA Astrophysics Data System (ADS)

    Jackson, Deborah

    In the early 1990, photonics, the confluence of electronics, and optics technologies to improve net processing efficiency was advanced to the highest priority ranking on the DoD critical technologies list. Currently, photonics is considered a high-leverage technology because it is believed that photonic processors could potentially circumvent the serial processor limitation, or von Neuman bottleneck, which limits the throughput capacity of most electronic processors. Indeed, the realtime solutions to currently military problems, such as high-accurate missile guidance, sensor fusion, automatic target recognition, automated guidance of remotely piloted vehicles, etc., are consistently crippled by information processing bottlenecks. Such bottlenecks are particularly endemic to image-formatted data bases. An image-formatted data base is defined as a data base where, besides the information contained in each pixel, there is also information imparted by the spatial relationship among the data in the pixels. Thus, in image data, variations in grey scale are used to define edges and corners. To extract the spatially imparted information, it is often necessary to compare N x N pixels in the input image with the N x N pixels in a model image; this process takes N exp 4 comparison calculations. As the demand for higher resolution imagery increases and N gets larger, it becomes increasingly more difficult to make the image comparisons in realtime. Currently, digital electronic processor designs are optimized for numerical processing, which is an intrinsically serial operation. It is this serial nature that causes the limitation; the photonic processor, which can be designed with a more parallel architecture, has potential for circumventing this bottleneck. It is, therefore, anticipated that the intrinsic parallelism of optics will enable the photonic processor to solve problems in realtime that were previously considered unsolvable or only marginally solvable.

  19. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  20. Integrated fuel processor development.

    SciTech Connect

    Ahmed, S.; Pereira, C.; Lee, S. H. D.; Krumpelt, M.

    2001-12-04

    The Department of Energy's Office of Advanced Automotive Technologies has been supporting the development of fuel-flexible fuel processors at Argonne National Laboratory. These fuel processors will enable fuel cell vehicles to operate on fuels available through the existing infrastructure. The constraints of on-board space and weight require that these fuel processors be designed to be compact and lightweight, while meeting the performance targets for efficiency and gas quality needed for the fuel cell. This paper discusses the performance of a prototype fuel processor that has been designed and fabricated to operate with liquid fuels, such as gasoline, ethanol, methanol, etc. Rated for a capacity of 10 kWe (one-fifth of that needed for a car), the prototype fuel processor integrates the unit operations (vaporization, heat exchange, etc.) and processes (reforming, water-gas shift, preferential oxidation reactions, etc.) necessary to produce the hydrogen-rich gas (reformate) that will fuel the polymer electrolyte fuel cell stacks. The fuel processor work is being complemented by analytical and fundamental research. With the ultimate objective of meeting on-board fuel processor goals, these studies include: modeling fuel cell systems to identify design and operating features; evaluating alternative fuel processing options; and developing appropriate catalysts and materials. Issues and outstanding challenges that need to be overcome in order to develop practical, on-board devices are discussed.

  1. Efficiency of parallel direct optimization.

    PubMed

    Janies, D A; Wheeler, W C

    2001-03-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. PMID:12240679

  2. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  3. Power processor design considerations for a solar electric propulsion spacecraft

    NASA Technical Reports Server (NTRS)

    Costogue, E. N.; Gardner, J. A.

    1974-01-01

    Propulsion power processor design options are described. The propulsion power processor generated the regulated dc voltages and currents from a solar array source of a solar electric propelled spacecraft. The power processor consisted of 12 power supplies that provide the regulated voltages and currents necessary to power a 30-cm mercury ion thruster. The design options for processing unregulated solar array power and for generating the regulated power required by each supply are studied. The technical approaches utilized in the developed design and the technological limitation of the identified design options are discussed. Alternate approaches for delivering power to a number of mercury ion thrusters and methods of optimizing are described. It was concluded that this power processor design should be considered for application in solar electric propulsion missions of the future.

  4. Parallel VLSI architecture emulation and the organization of APSA/MPP

    NASA Technical Reports Server (NTRS)

    Odonnell, John T.

    1987-01-01

    The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

  5. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.

  6. QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays.

    PubMed

    Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

    2014-01-01

    Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as "smart" Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

  7. QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays

    PubMed Central

    Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

    2014-01-01

    Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as “smart” Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

  8. TST onboard processor

    NASA Astrophysics Data System (ADS)

    Alaria, G. B.; Ventimiglia, G.; Pennoni, G.

    An onboard processor with time-space-time (TST) stages for telecommunications satellites is described. The overall system characteristics and main functional characteristics are specified, and the frame structure of the system is shown and described. An initial acquisition procedure for synchronizing and initializing the processor is discussed. The functional blocks constituting the processor are described, showing a block diagram and listing data pertinent to system complexity with regard to integrated circuits and dissipated power. The role of the semicustom approach in future payload developments is briefly discussed.

  9. Parallel Architecture For Robotics Computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1990-01-01

    Universal Real-Time Robotic Controller and Simulator (URRCS) is highly parallel computing architecture for control and simulation of robot motion. Result of extensive algorithmic study of different kinematic and dynamic computational problems arising in control and simulation of robot motion. Study led to development of class of efficient parallel algorithms for these problems. Represents algorithmically specialized architecture, in sense capable of exploiting common properties of this class of parallel algorithms. System with both MIMD and SIMD capabilities. Regarded as processor attached to bus of external host processor, as part of bus memory.

  10. Multiple Embedded Processors for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

    2005-01-01

    A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

  11. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    SciTech Connect

    Liao, C; Quinlan, D J; Willcock, J J; Panas, T

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

  12. On Partitioning GridStructured Parallel Computations BHAGIRATH NARAHARI

    E-print Network

    Simha, Rahul

    and taking the maximum among all processors. The central consider the problem of mapping each node in the graph to a processor in a linear array. We focus of consecutive columns to be mapped to a processor; we call this weaker contiguity constraint the part

  13. Approximate programmable quantum processors

    SciTech Connect

    Hillery, Mark; Ziman, Mario; Buzek, Vladimir

    2006-02-15

    A quantum processor is a programmable quantum circuit in which both the data and the program, which specifies the operation that is carried out on the data, are quantum states. We study the situation in which we want to use such a processor to approximate a set of unitary operators to a specified level of precision. We measure how well an operation is performed by the process fidelity between the desired operation and the operation produced by the processor. We show how to find the program for a given processor that produces the best approximation of a particular unitary operation. We also place bounds on the dimension of the program space that is necessary to approximate a set of unitary operators to a specified level of precision.

  14. Scaling and Graphical Transport-Map Analysis of Ambipolar Schottky-Barrier Thin-Film Transistors Based on a Parallel Array of Si Nanowires.

    PubMed

    Jeon, Dae-Young; Pregl, Sebastian; Park, So Jeong; Baraban, Larysa; Cuniberti, Gianaurelio; Mikolajick, Thomas; Weber, Walter M

    2015-07-01

    Si nanowire (Si-NW) based thin-film transistors (TFTs) have been considered as a promising candidate for next-generation flexible and wearable electronics as well as sensor applications with high performance. Here, we have fabricated ambipolar Schottky-barrier (SB) TFTs consisting of a parallel array of Si-NWs and performed an in-depth study related to their electrical performance and operation mechanism through several electrical parameters extracted from the channel length scaling based method. Especially, the newly suggested current-voltage (I-V) contour map clearly elucidates the unique operation mechanism of the ambipolar SB-TFTs, governed by Schottky-junction between NiSi2 and Si-NW. Further, it reveals for the first-time in SB based FETs the important internal electrostatic coupling between the channel and externally applied voltages. This work provides helpful information for the realization of practical circuits with ambipolar SB-TFTs that can be transferred to different substrate technologies and applications. PMID:26087437

  15. NWChem: scalable parallel computational chemistry

    SciTech Connect

    van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

    2011-11-01

    NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

  16. Is Monte Carlo embarrassingly parallel?

    SciTech Connect

    Hoogenboom, J. E.

    2012-07-01

    Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

  17. Parallel design patterns for a low-power, software-defined compressed video encoder

    NASA Astrophysics Data System (ADS)

    Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar

    2011-06-01

    Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.

  18. Soft-core processor study for node-based architectures.

    SciTech Connect

    Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

    2008-09-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

  19. Data Parallel Computing on Graphics Hardware

    E-print Network

    Bejerano, Gill

    Data Parallel Computing on Graphics Hardware Data Parallel Computing on Graphics Hardware Ian Buck of GPUs as Streaming processor #12;July 27th, 2003 3 Why graphics hardwareWhy graphics hardware Raw variables ­ No Read-Modify-Write textures · Multiple "pixel pipes" ­ Data Parallelism · Support ALU heavy

  20. Highly parallel computer architecture for robotic computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (inventor); Bejczy, Anta K. (inventor)

    1991-01-01

    In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  1. Parallel Monte Carlo simulation of multilattice thin film growth

    NASA Astrophysics Data System (ADS)

    Shu, J. W.; Lu, Qin; Wong, Wai-on; Huang, Han-chen

    2001-07-01

    This paper describe a new parallel algorithm for the multi-lattice Monte Carlo atomistic simulator for thin film deposition (ADEPT), implemented on parallel computer using the PVM (Parallel Virtual Machine) message passing library. This parallel algorithm is based on domain decomposition with overlapping and asynchronous communication. Multiple lattices are represented by a single reference lattice through one-to-one mappings, with resulting computational demands being comparable to those in the single-lattice Monte Carlo model. Asynchronous communication and domain overlapping techniques are used to reduce the waiting time and communication time among parallel processors. Results show that the algorithm is highly efficient with large number of processors. The algorithm was implemented on a parallel machine with 50 processors, and it is suitable for parallel Monte Carlo simulation of thin film growth with either a distributed memory parallel computer or a shared memory machine with message passing libraries. In this paper, the significant communication time in parallel MC simulation of thin film growth is effectively reduced by adopting domain decomposition with overlapping between sub-domains and asynchronous communication among processors. The overhead of communication does not increase evidently and speedup shows an ascending tendency when the number of processor increases. A near linear increase in computing speed was achieved with number of processors increases and there is no theoretical limit on the number of processors to be used. The techniques developed in this work are also suitable for the implementation of the Monte Carlo code on other parallel systems.

  2. Graphite: A Distributed Parallel Simulator for Multicores

    E-print Network

    Beckmann, Nathan

    2009-11-09

    This paper introduces the open-source Graphite distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, ...

  3. NAS Parallel Benchmarks Results

    NASA Technical Reports Server (NTRS)

    Subhash, Saini; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

    1995-01-01

    The NAS Parallel Benchmarks (NPB) were developed in 1991 at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a pencil and paper fashion i.e. the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. In this paper, we present new NPB performance results for the following systems: (a) Parallel-Vector Processors: Cray C90, Cray T'90 and Fujitsu VPP500; (b) Highly Parallel Processors: Cray T3D, IBM SP2 and IBM SP-TN2 (Thin Nodes 2); (c) Symmetric Multiprocessing Processors: Convex Exemplar SPP1000, Cray J90, DEC Alpha Server 8400 5/300, and SGI Power Challenge XL. We also present sustained performance per dollar for Class B LU, SP and BT benchmarks. We also mention NAS future plans of NPB.

  4. Processor Allocation on Cplant: Achieving General Processor Locality Using One-Dimensional Allocation Strategies

    SciTech Connect

    LEUNG,VITUS J.; ARKIN,ESTHER M.; BENDER,MICHAEL A.; BUNDE,DAVID; JOHNSTON,JEANETTE R.; LAL,ALOK; MITCHELL,JOSEPH S.B.; PHILLIPS,CYNTHIA; SEIDEN,STEVEN S.

    2002-07-01

    The Computational Plant, or Cplant is a commodity-based supercomputer under development at Sandia National Laboratories. This paper describes resource-allocation strategies to achieve processor locality for parallel jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomputers submit parallel jobs to a job queue. When a job is scheduled to run, it is assigned to a set of processors. To obtain maximum throughput, jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This paper introduces new allocation strategies and performance metrics based on space-tilling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in the new release of the Cplant System Software, Version 2.0, phased into the Cplant systems at Sandia by May 2002.

  5. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  6. Extended Parallelism Models for Optimization on Massively Parallel Computers

    SciTech Connect

    Eldred, M.S.; Schimel, B.D.

    1999-05-24

    Single-level parallel optimization approaches, those in which either the simulation code executes in parallel or the optimiza- tion algorithm invokes multiple simultaneous single-processor analyses, have been investigated previously and been shown to be effective in reducing the time required to compute optimal solutions. However, these approaches have clear performance limita- tions that prevent effective scaling with the thousands of processors available in massively parallel supercomputers. In more recent work, a capability has been developed for multilevel parallelism in which multiple instances of multiprocessor simulations are coordinated simultaneously. This implementation employs a master-slave approach using the Message Passing Interface (MPI) within the DAKOTA software toolkit. Mathematical analysis on achieving peak efficiency in multilevel parallelism has shown that the most effective processor partitioning scheme is the one that limits the size of multiprocessor simulations in favor of concurrent execution of multiple simulations. That is, if both coarse-grained and fine-grained parallelism can be exploited, then preference should be given to the coarse-grained parallelism. This analysis was verified in multilevel paralIel computatiorud experiments on networks of workstations (NOWS) and on the Intel TeraFLOPS massively parallel supercomputer. In current work, methods for exploiting additional coarse-grained parallelism in optimization are being investigated so that fine-grained efficiency losses can be further minimized. These activities are focusing on both algorithmic coarse-grained parallel- ism (multiple independent function evaluations) through the development of speculative gradient methods and concurrent iterator strategies and on function evaluation coarse-grained parallelism (multiple separable simulations within a function evaluation) through the development of general partitioning and nested synchronization facilities. The net result is a total of four separate lev- els of parallelism which can minimize efficiency losses and achieve near linear scaling on massively parallel computers.

  7. Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

    NASA Technical Reports Server (NTRS)

    Gentzsch, W.

    1982-01-01

    Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

  8. Fast Parallel Computation Of Multibody Dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Kwan, Gregory L.; Bagherzadeh, Nader

    1996-01-01

    Constraint-force algorithm fast, efficient, parallel-computation algorithm for solving forward dynamics problem of multibody system like robot arm or vehicle. Solves problem in minimum time proportional to log(N) by use of optimal number of processors proportional to N, where N is number of dynamical degrees of freedom: in this sense, constraint-force algorithm both time-optimal and processor-optimal parallel-processing algorithm.

  9. VLSI Processor For Vector Quantization

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul

    1995-01-01

    Pixel intensities in each kernel compared simultaneously with all code vectors. Prototype high-performance, low-power, very-large-scale integrated (VLSI) circuit designed to perform compression of image data by vector-quantization method. Contains relatively simple analog computational cells operating on direct or buffered outputs of photodetectors grouped into blocks in imaging array, yielding vector-quantization code word for each such block in sequence. Scheme exploits parallel-processing nature of vector-quantization architecture, with consequent increase in speed.

  10. Applications of Parallel Processing to Astrodynamics

    NASA Astrophysics Data System (ADS)

    Coffey, S.; Healy, L.; Neal, H.

    1996-03-01

    Parallel processing is being used to improve the catalog of earth orbiting satellites and for problems associated with the catalog. Initial efforts centered around using SIMD parallel processors to perform debris conjunction analysis and satellite dynamics studies. More recently, the availability of cheap supercomputing processors and parallel processing software such as PVM have enabled the reutilization of existing astrodynamics software in distributed parallel processing environments, Computations once taking many days with traditional mainframes are now being performed in only a few hours. Efforts underway for the US Naval Space Command include conjunction prediction, uncorrelated target processing and a new space object catalog based on orbit determination and prediction with special perturbations methods.

  11. Sesame: A User-Transparent Optimizing Framework for Many-Core Processors

    E-print Network

    Sesame: A User-Transparent Optimizing Framework for Many-Core Processors Jianbin Fang, Ana Lucia applications running on many-core processors (Sesame). Taking a simple parallelized code provided by the application programmers as input, Sesame chooses and applies the most suitable architecture- specific

  12. Parallelization of a treecode

    E-print Network

    R. Valdarnini

    2003-03-18

    I describe here the performance of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture using the MPI message-passing library. For a configuration with a constant number of particles per processor the scalability of the code was tested up to P=128 processors on an IBM SP4 machine. In the large $P$ limit the average CPU time per processor necessary for solving the gravitational interactions is $\\sim 10 %$ higher than that expected from the ideal scaling relation. The processor domains are determined every large timestep according to a recursive orthogonal bisection, using a weighting scheme which takes into account the total particle computational load within the timestep. The results of the numerical tests show that the load balancing efficiency $L$ of the code is high ($>=90%$) up to P=32, and decreases to $L\\sim 80%$ when P=128. In the latter case it is found that some aspects of the code performance are affected by machine hardware, while the proposed weighting scheme can achieve a load balance as high as $L\\sim 90%$ even in the large $P$ limit.

  13. Fault detection and bypass in a sequence information signal processor

    NASA Technical Reports Server (NTRS)

    Peterson, John C. (Inventor); Chow, Edward T. (Inventor)

    1992-01-01

    The invention comprises a plurality of scan registers, each such register respectively associated with a processor element; an on-chip comparator, encoder and fault bypass register. Each scan register generates a unitary signal the logic state of which depends on the correctness of the input from the previous processor in the systolic array. These unitary signals are input to a common comparator which generates an output indicating whether or not an error has occurred. These unitary signals are also input to an encoder which identifies the location of any fault detected so that an appropriate multiplexer can be switched to bypass the faulty processor element. Input scan data can be readily programmed to fully exercise all of the processor elements so that no fault can remain undetected.

  14. A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC (Superconducting Super Collider) detectors

    SciTech Connect

    Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C. ); Lockyer, N.; VanBerg, R. )

    1989-12-01

    A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab.

  15. J. Parallel Distrib. Comput. 65 (2005) 374381 www.elsevier.com/locate/jpdc

    E-print Network

    Pan, Yi

    2005-01-01

    (POB) [32], the pipelined reconfig- urable mesh (PR-mesh) [6,28], the array with reconfigurable optical buses (AROB) [22,23], the array processors with pipelined buses (APPB) [16], the array processors to have addressed the issue of fault tolerance for any of the optically pipelined models is that by #12;A

  16. Implementing clips on a parallel computer

    NASA Technical Reports Server (NTRS)

    Riley, Gary

    1987-01-01

    The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

  17. Fast Parallel Computation Of Manipulator Inverse Dynamics

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1991-01-01

    Method for fast parallel computation of inverse dynamics problem, essential for real-time dynamic control and simulation of robot manipulators, undergoing development. Enables exploitation of high degree of parallelism and, achievement of significant computational efficiency, while minimizing various communication and synchronization overheads as well as complexity of required computer architecture. Universal real-time robotic controller and simulator (URRCS) consists of internal host processor and several SIMD processors with ring topology. Architecture modular and expandable: more SIMD processors added to match size of problem. Operate asynchronously and in MIMD fashion.

  18. Parallelization of the CI Program PEDICI

    NASA Astrophysics Data System (ADS)

    Thorsteinsson, Thorstein; Rettrup, Sten

    The general CI code PEDICI has been parallelized by decomposing the occurring summation over two-electron integrals. The parallelization was formulated in terms of a "master/slave'' model, and realized through use of the "PVM'' message passing facility. We have aimed at achieving a reasonably simple implementation for use on machines with intermediate numbers of processors. Exploratory test runs on an IBM SP supercomputer (consisting of RS/6000 model P2SC (120 MHz) nodes) show a very satisfactory performance increase with the number of processors used, as well as encouraging balancing of the workload. Our largest 32-processor test case gives a speed-up factor of 30.27.

  19. Efficacy of Code Optimization on Cache-based Processors

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)

    1997-01-01

    The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.

  20. Parallel machine architecture for production rule systems

    DOEpatents

    Allen, Jr., John D. (Knoxville, TN); Butler, Philip L. (Knoxville, TN)

    1989-01-01

    A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

  1. Electrostatically focused addressable field emission array chips (AFEA's) for high-speed massively parallel maskless digital E-beam direct write lithography and scanning electron microscopy

    DOEpatents

    Thomas, Clarence E. (Knoxville, TN); Baylor, Larry R. (Farragut, TN); Voelkl, Edgar (Oak Ridge, TN); Simpson, Michael L. (Knoxville, TN); Paulus, Michael J. (Knoxville, TN); Lowndes, Douglas H. (Knoxville, TN); Whealton, John H. (Oak Ridge, TN); Whitson, John C. (Clinton, TN); Wilgen, John B. (Oak Ridge, TN)

    2002-12-24

    Systems and methods are described for addressable field emission array (AFEA) chips. A method of operating an addressable field-emission array, includes: generating a plurality of electron beams from a pluralitly of emitters that compose the addressable field-emission array; and focusing at least one of the plurality of electron beams with an on-chip electrostatic focusing stack. The systems and methods provide advantages including the avoidance of space-charge blow-up.

  2. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, D.B.

    1996-12-31

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

  3. Scalable parallel communications

    NASA Technical Reports Server (NTRS)

    Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

    1992-01-01

    Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.

  4. FY 2006 Accomplishment Colony - "Services and Interfaces to Support Large Numbers of Processors"

    SciTech Connect

    Jones, T; Kale, L; Moreira, J; Mendes, C; Chakravorty, S; Tauferner, A; Inglett, T

    2006-06-30

    The Colony Project is developing operating system and runtime system technology to enable efficient general purpose environments on tens of thousands of processors. To accomplish this, we are investigating memory management techniques, fault management strategies, and parallel resource management schemes. Recent results show promising findings for scalable strategies based on processor virtualization, in-memory checkpointing, and parallel aware modifications to full featured operating systems.

  5. Complementing user-level coarse-grain parallelism with implicit speculative parallelism 

    E-print Network

    Ioannou, Nikolas

    2012-11-29

    Multi-core and many-core systems are the norm in contemporary processor technology and are expected to remain so for the foreseeable future. Parallel programming is, thus, here to stay and programmers have to endorse it ...

  6. On one-way cellular arrays

    SciTech Connect

    Ibarra, O.H.; Jiang, T.

    1987-12-01

    There are two simple models of a parallel language recognizer: one-way cellular array (OCA) and one-way iterative array (OIA). For inputs of length n, both arrays consist of n identical finite-state machines (cells). The communication between cells is one way, from left to right. The difference in the two models is in the manner in which the input is applied. For the OCA, the input is applied to the cells in parallel. For the OIA, the input is applied serially to the leftmost processor. An input string is accepted if the rightmost cell eventually enters an accepting state. The authors show that OCA's accept exactly the same class of languages as OIA's. It is relatively easy to show that OIA's can simulate OCA's. The difficult part is the converse, i.e., that OCA's can simulate OIA's. This is rather surprising, since in an OIA, every cell of the array has access to each symbol of the input string, whereas in an OCA, the ith cell can only access the first i symbols of the input. This result, when combined with known results concerning OIA's, answers some open questions concerning the computational complexity of OCA's. They also prove some new results concerning linear-time OCA's and OIA's. For example, they show: (a) linear-time OCA's are equivalent to 2n-time OIA's (note that 2n-time is optimal for OIA's); (2) the concatenation of a linear-time OCA language with a real-time (i.e. n-time) OCA language is a linear-time OCA language; (3) every bounded language accepted by a one-way multihead nondeterministic pushdown automation is a linear-time OCA language.

  7. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    NASA Astrophysics Data System (ADS)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.

  8. Stereoscopic Optical Signal Processor

    NASA Technical Reports Server (NTRS)

    Graig, Glenn D.

    1988-01-01

    Optical signal processor produces two-dimensional cross correlation of images from steroscopic video camera in real time. Cross correlation used to identify object, determines distance, or measures movement. Left and right cameras modulate beams from light source for correlation in video detector. Switch in position 1 produces information about range of object viewed by cameras. Position 2 gives information about movement. Position 3 helps to identify object.

  9. Distributed processor allocation for launching applications in a massively connected processors complex

    DOEpatents

    Pedretti, Kevin (Goleta, CA)

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  10. Reconfigurable Array Interconnection by Photorefractive Volume Holography

    NASA Astrophysics Data System (ADS)

    Ford, Joseph Earl

    Parallel computing algorithms can be effectively implemented by combining local electronic processing with global optical interconnection. This dissertation describes the development of an optical array interconnection network based on photorefractive volume holography. The approach taken uses the correlation matrix-tensor multiplier (CMTM) algorithm, which optically convolves the phase-encoded input array with a control image holding the interconnection weights. Two-dimensional arrays can be interconnected with complex grey-level weights using binary phase-only spatial light modulators. The CMTM algorithm allows graceful accomodation of limited modulator size by trading off control image bandwidth for output signal to noise ratio. The optical correlation was performed by photorefractive four-wave mixing, storing the interconnection information in a single exposure of the control image. Multiple interconection patterns were prestored as color-multiplexed volume reflection holograms in z-cut LiNbO_3. Fast reconfiguration between interconnection patterns is possible using a wavelength tunable source, decoupling both the modulation and switching speeds from the slow photorefractive response. Experimental results confirmed theoretical predictions that the algorithm works best for densely connected networks, with a large fan-in to each output. Interconnection of up to 4096 inputs and outputs was demonstrated using such dense interconnection patterns. An aggregate average SNR of over 200 was obtained for 1024 inputs and outputs. Finally, a compact packaged optoelectonic processor system using CMTM interconnection is proposed, and its scaling behavior investigated.

  11. Radiofrequency detector coil performance maps for parallel MRI applications

    E-print Network

    Lattanzi, Riccardo

    2006-01-01

    Parallel MRI techniques allow acceleration of MR imaging beyond traditional speed limits. In parallel MRI, arrays of radiofrequency (RF) detector coil arrays are used to perform some degree of spatial encoding which ...

  12. Software-Reconfigurable Processors for Spacecraft

    NASA Technical Reports Server (NTRS)

    Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

    2005-01-01

    A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).

  13. Parallel algorithms for finding trigonometric sums

    SciTech Connect

    Stpiczynski, P.; Paprzycki, M.

    1995-12-01

    Parallel versions of Goertzel and Reinsch algorithms for finding trigonometric sums are introduced as a special case of effcient parallel algorithms for solving linear recurrence systems. The results of the experiments performed on a 20-processors Sequent Symmetry are presented and discussed.

  14. Highly scalable linear solvers on thousands of processors.

    SciTech Connect

    Domino, Stefan Paul; Karlin, Ian; Siefert, Christopher; Hu, Jonathan Joseph; Robinson, Allen Conrad; Tuminaro, Raymond Stephen

    2009-09-01

    In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

  15. SIMD-parallel understanding of natural language with application to magnitude-only optical parsing of text

    NASA Astrophysics Data System (ADS)

    Schmalz, Mark S.

    1992-08-01

    A novel parallel model of natural language (NL) understanding is presented which can realize high levels of semantic abstraction, and is designed for implementation on synchronous SIMD architectures and optical processors. Theory is expressed in terms of the Image Algebra (IA), a rigorous, concise, inherently parallel notation which unifies the design, analysis, and implementation of image processing algorithms. The IA has been implemented on numerous parallel architectures, and IA preprocessors and interpreters are available for the FORTRAN and Ada languages. In a previous study, we demonstrated the utility of IA for mapping MEA- conformable (Multiple Execution Array) algorithms to optical architectures. In this study, we extend our previous theory to map serial parsing algorithms to the synchronous SIMD paradigm. We initially derive a two-dimensional image that is based upon the adjacency matrix of a semantic graph. Via IA template mappings, the operations of bottom-up parsing, semantic disambiguation, and referential resolution are implemented as image-processing operations upon the adjacency matrix. Pixel-level operations are constrained to Hadamard addition and multiplication, thresholding, and row/column summation, which are available in magnitude-only optics. Assuming high parallelism in the parse rule base, the parsing of n input symbols with a grammar consisting of M rules of arity H, on an N-processor architecture, could exhibit time complexity of T(n) parallelism, the computational cost is constant and of order H. Since H < < n is typical, we claim a fundamental complexity advantage over the current O(n) theoretical time limit of MIMD parsing architectures. Additionally, we show that inference over a semantic net is achievable is parallel in O(m) time, where m corresponds to the depth of the search tree. Results are evaluated in terms of computational cost on SISD and SIMD processors, with discussion of implementation on electro-optic architectures.

  16. Silicon Auditory Processors Computer Peripherals

    E-print Network

    Lazzaro, John

    Silicon Auditory Processors as Computer Peripherals John Lazzaro, John Wawrzynek CS Division UC describe an alternative output method for silicon auditory models, suitable for direct interface to digital

  17. Parallel 3-D method of characteristics in MPACT

    SciTech Connect

    Kochunas, B.; Dovvnar, T. J.; Liu, Z.

    2013-07-01

    A new parallel 3-D MOC kernel has been developed and implemented in MPACT which makes use of the modular ray tracing technique to reduce computational requirements and to facilitate parallel decomposition. The parallel model makes use of both distributed and shared memory parallelism which are implemented with the MPI and OpenMP standards, respectively. The kernel is capable of parallel decomposition of problems in space, angle, and by characteristic rays up to 0(104) processors. Initial verification of the parallel 3-D MOC kernel was performed using the Takeda 3-D transport benchmark problems. The eigenvalues computed by MPACT are within the statistical uncertainty of the benchmark reference and agree well with the averages of other participants. The MPACT k{sub eff} differs from the benchmark results for rodded and un-rodded cases by 11 and -40 pcm, respectively. The calculations were performed for various numbers of processors and parallel decompositions up to 15625 processors; all producing the same result at convergence. The parallel efficiency of the worst case was 60%, while very good efficiency (>95%) was observed for cases using 500 processors. The overall run time for the 500 processor case was 231 seconds and 19 seconds for the case with 15625 processors. Ongoing work is focused on developing theoretical performance models and the implementation of acceleration techniques to minimize the number of iterations to converge. (authors)

  18. Reconfigurable data path processor

    NASA Technical Reports Server (NTRS)

    Donohoe, Gregory (Inventor)

    2005-01-01

    A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.

  19. Parallel Recording of Neurotransmitters Release from Chromaffin Cells Using a 10 × 10 CMOS IC Potentiostat Array with On-Chip Working Electrodes

    PubMed Central

    Kim, Brian Namghi; Herbst, Adam D.; Kim, Sung June; Minch, Bradley A.; Lindau, Manfred

    2012-01-01

    Neurotransmitter release is modulated by many drugs and molecular manipulations. We present an active CMOS-based electrochemical biosensor array with high throughput capability (100 electrodes) for on-chip amperometric measurement of neurotransmitter release. The high-throughput of the biosensor array will accelerate the data collection needed to determine statistical significance of changes produced under varying conditions, from several weeks to a few hours. The biosensor is designed and fabricated using a combination of CMOS integrated circuit (IC) technology and a photolithography process to incorporate platinum working electrodes on-chip. We demonstrate the operation of an electrode array with integrated high-gain potentiostats and output time-division multiplexing with minimum dead time for readout. The on-chip working electrodes are patterned by conformal deposition of Pt and lift-off photolithography. The conformal deposition method protects the underlying electronic circuits from contact with the electrolyte that covers the electrode array during measurement. The biosensor was validated by simultaneous measurement of amperometric currents from 100 electrodes in response to dopamine injection, which revealed the time course of dopamine diffusion along the surface of the biosensor array. The biosensor simultaneously recorded neurotransmitter release successfully from multiple individual living chromaffin cells. The biosensor was capable of resolving small and fast amperometric spikes reporting release from individual vesicle secretions. We anticipate that this device will accelerate the characterization of the modulation of neurotransmitter secretion from neuronal and endocrine cells by pharmacological and molecular manipulations of the cells. PMID:23084756

  20. A Low Power Front-End for Embedded Processors Using a Block-Aware Instruction Set

    E-print Network

    Kozyrakis, Christos

    fetching happens on nearly every cycle and involves accesses to large mem- ory arrays such as instruction with small front-end arrays (2- KByte, 2-way I-Cache, 16-entry BTB). The processor core is similar to Intel better results. cost to manufacture, package, and cool the chip. Energy consump- tion determines

  1. A generic fine-grained parallel C

    NASA Technical Reports Server (NTRS)

    Hamet, L.; Dorband, John E.

    1988-01-01

    With the present availability of parallel processors of vastly different architectures, there is a need for a common language interface to multiple types of machines. The parallel C compiler, currently under development, is intended to be such a language. This language is based on the belief that an algorithm designed around fine-grained parallelism can be mapped relatively easily to different parallel architectures, since a large percentage of the parallelism has been identified. The compiler generates a FORTH-like machine-independent intermediate code. A machine-dependent translator will reside on each machine to generate the appropriate executable code, taking advantage of the particular architectures. The goal of this project is to allow a user to run the same program on such machines as the Massively Parallel Processor, the CRAY, the Connection Machine, and the CYBER 205 as well as serial machines such as VAXes, Macintoshes and Sun workstations.

  2. Parallel automated adaptive procedures for unstructured meshes

    NASA Technical Reports Server (NTRS)

    Shephard, M. S.; Flaherty, J. E.; Decougny, H. L.; Ozturan, C.; Bottasso, C. L.; Beall, M. W.

    1995-01-01

    Consideration is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers. The key areas of new development are focused on the support of effective parallel computations when the structure of the numerical discretization, the mesh, is evolving, and in fact constructed, during the computation. All the procedures presented operate in parallel on already distributed mesh information. Starting from a mesh definition in terms of a topological hierarchy, techniques to support the distribution, redistribution and communication among the mesh entities over the processors is given, and algorithms to dynamically balance processor workload based on the migration of mesh entities are given. A procedure to automatically generate meshes in parallel, starting from CAD geometric models, is given. Parallel procedures to enrich the mesh through local mesh modifications are also given. Finally, the combination of these techniques to produce a parallel automated finite element analysis procedure for rotorcraft aerodynamics calculations is discussed and demonstrated.

  3. Parallel processing data network of master and slave transputers controlled by a serial control network

    SciTech Connect

    Crosetto, Dario B.

    1996-01-01

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

  4. The 2nd Symposium on the Frontiers of Massively Parallel Computations

    NASA Technical Reports Server (NTRS)

    Mills, Ronnie (editor)

    1988-01-01

    Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

  5. Compiler Transformations to Generate Reentrant C Programs to Assist Software Parallelization

    E-print Network

    Smith, Adam

    2009-06-16

    programs is considered a non-trivial task, writing parallel applications to take ad- vantage of the advances in the number of cores in a processor severely complicates the process. Writing parallel applications requires programs and functions...

  6. Never Trust Your Word Processor

    ERIC Educational Resources Information Center

    Linke, Dirk

    2009-01-01

    In this article, the author talks about the auto correction mode of word processors that leads to a number of problems and describes an example in biochemistry exams that shows how word processors can lead to mistakes in databases and in papers. The author contends that, where this system is applied, spell checking should not be left to a word…

  7. Compact hohlraum configuration with parallel planar-wire-array x-ray sources at the 1.7-MA Zebra generator.

    PubMed

    Kantsyrev, V L; Chuvatin, A S; Rudakov, L I; Velikovich, A L; Shrestha, I K; Esaulov, A A; Safronova, A S; Shlyaptseva, V V; Osborne, G C; Astanovitsky, A L; Weller, M E; Stafford, A; Schultz, K A; Cooper, M C; Cuneo, M E; Jones, B; Vesey, R A

    2014-12-01

    A compact Z-pinch x-ray hohlraum design with parallel-driven x-ray sources is experimentally demonstrated in a configuration with a central target and tailored shine shields at a 1.7-MA Zebra generator. Driving in parallel two magnetically decoupled compact double-planar-wire Z pinches has demonstrated the generation of synchronized x-ray bursts that correlated well in time with x-ray emission from a central reemission target. Good agreement between simulated and measured hohlraum radiation temperature of the central target is shown. The advantages of compact hohlraum design applications for multi-MA facilities are discussed. PMID:25615200

  8. Compact hohlraum configuration with parallel planar-wire-array x-ray sources at the 1.7-MA Zebra generator

    NASA Astrophysics Data System (ADS)

    Kantsyrev, V. L.; Chuvatin, A. S.; Rudakov, L. I.; Velikovich, A. L.; Shrestha, I. K.; Esaulov, A. A.; Safronova, A. S.; Shlyaptseva, V. V.; Osborne, G. C.; Astanovitsky, A. L.; Weller, M. E.; Stafford, A.; Schultz, K. A.; Cooper, M. C.; Cuneo, M. E.; Jones, B.; Vesey, R. A.

    2014-12-01

    A compact Z-pinch x-ray hohlraum design with parallel-driven x-ray sources is experimentally demonstrated in a configuration with a central target and tailored shine shields at a 1.7-MA Zebra generator. Driving in parallel two magnetically decoupled compact double-planar-wire Z pinches has demonstrated the generation of synchronized x-ray bursts that correlated well in time with x-ray emission from a central reemission target. Good agreement between simulated and measured hohlraum radiation temperature of the central target is shown. The advantages of compact hohlraum design applications for multi-MA facilities are discussed.

  9. Performance characteristics of a parallel treecode

    E-print Network

    R. Valdarnini

    2002-12-11

    I describe here the performances of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture using the MPI message passing library. For a configuration with a constant number of particles per processor the scalability of the code has been tested up to P=32 processors. The average CPU time per processor necessary for solving the gravitational interactions is within $\\sim 10 %$ of that expected from the ideal scaling relation. The load balancing efficiency is high ($\\simgt90%$) if the processor domains are determined every large timestep according to a weighting scheme which takes into account the total particle computational load within the timestep.

  10. Parallel implementation of an algorithm for Delaunay triangulation

    NASA Technical Reports Server (NTRS)

    Merriam, Marshall L.

    1992-01-01

    This work concerns the theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer. Tanemura's algorithm does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a conventional, vector processing, supercomputer is problematic. Efficient implementation on a parallel architecture is possible, however. In this work, speeds in excess of 8 times a single processor Cray Y-mp are realized on 128 processors of the Intel Gamma prototype.

  11. Parallel implementation of an algorithm for Delaunay triangulation

    NASA Technical Reports Server (NTRS)

    Merriam, Marshal L.

    1992-01-01

    The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.

  12. Low-power, parallel photonic interconnections for Multi-Chip Module applications

    SciTech Connect

    Carson, R.F.; Lovejoy, M.L.; Lear, K.L.

    1994-12-31

    New applications of photonic interconnects will involve the insertion of parallel-channel links into Multi-Chip Modules (MCMs). Such applications will drive photonic link components into more compact forms that consume far less power than traditional telecommunication data links. MCM-based applications will also require simplified drive circuitry, lower cost, and higher reliability than has been demonstrated currently in photonic and optoelectronic technologies. The work described is a parallel link array, designed for vertical (Z-Axis) interconnection of the layers in a MCM-based signal processor stack, operating at a data rate of 100 Mb/s. This interconnect is based upon high-efficiency VCSELs, HBT photoreceivers, integrated micro-optics, and MCM-compatible packaging techniques.

  13. Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput

    E-print Network

    Sgall, Jiri

    Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput Ondrej Zaj´icek1 , Jir of scheduling unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between its release time and deadline on a subcube of processors. The objective is to max- imize the number of early jobs. We provide

  14. MIT Lincoln Laboratory Parallel Vector Tile-Optimized Library

    E-print Network

    Kepner, Jeremy

    1 PVTOL-1 6/23/07 MIT Lincoln Laboratory Parallel Vector Tile-Optimized Library (PVTOL with large IO and processing requirements Approach: Develop Parallel Vector Tile Optimizing Library (PVTOL of tiled processors ·Novel storage should provide 10x more IO FFTFFTA B C Automated Parallel Mapper P2P1P0

  15. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  16. Algorithms for Automatic Alignment of Arrays

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Oliker, Leonid; Schreiber, Robert; Sheffler, Thomas J.

    1996-01-01

    Aggregate data objects (such as arrays) are distributed across the processor memories when compiling a data-parallel language for a distributed-memory machine. The mapping determines the amount of communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: an alignment that maps all the objects to an abstract template, followed by a distribution that maps the template to the processors. This paper describes algorithms for solving the various facets of the alignment problem: axis and stride alignment, static and mobile offset alignment, and replication labeling. We show that optimal axis and stride alignment is NP-complete for general program graphs, and give a heuristic method that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. We also show how local graph contractions can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. We show how to model the static offset alignment problem using linear programming, and we show that loop-dependent mobile offset alignment is sometimes necessary for optimum performance. We describe an algorithm with for determining mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself or can be used to improve performance. We describe an algorithm based on network flow that replicates objects so as to minimize the total amount of broadcast communication in replication.

  17. Global Arrays

    Energy Science and Technology Software Center (ESTSC)

    2006-02-23

    The Global Arrays (GA) toolkit provides an efficient and portable ?shared-memory? programming interface for distributed-memory computers. Each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed dense multi-dimensional arrays, without need for explicit cooperation by other processes. Unlike other shared-memory environments, the GA model exposes to the programmer the non-uniform memory access (NUMA) characteristics of the high performance computers and acknowledges that access to a remote portion of the sharedmore »data is slower than to the local portion. The locality information for the shared data is available, and a direct access to the local portions of shared data is provided. Global Arrays have been designed to complement rather than substitute for the message-passing programming model. The programmer is free to use both the shared-memory and message-passing paradigms in the same program, and to take advantage of existing message-passing software libraries. Global Arrays are compatible with the Message Passing Interface (MPI).« less

  18. Scalable load balancing for massively parallel distributed Monte Carlo particle transport

    SciTech Connect

    O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

    2013-07-01

    In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrence Livermore National Laboratory. (authors)

  19. Asynchronous parallel status comparator

    DOEpatents

    Arnold, Jeffrey W. (828 Hickory Ridge Rd., Aiken, SC 29801); Hart, Mark M. (223 Limerick Dr., Aiken, SC 29803)

    1992-01-01

    Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.

  20. Asynchronous parallel status comparator

    DOEpatents

    Arnold, J.W.; Hart, M.M.

    1992-12-15

    Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.

  1. Benchmarking NWP Kernels on Multi- and Many-core Processors

    NASA Astrophysics Data System (ADS)

    Michalakes, J.; Vachharajani, M.

    2008-12-01

    Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.

  2. Roles of Parallelizing CompilersRoles of Parallelizing Compilers for Low Power Manycoresy

    E-print Network

    Kasahara, Hironori

    , at Univ. of Delaware built in Mar. 2011for low power many core hard , soft and applications, #12;ToRoles of Parallelizing CompilersRoles of Parallelizing Compilers for Low Power Manycoresy Hironori Kasahara Professor Department of Computer Science & Engineering Director Advanced Multicore Processor

  3. View-Oriented Parallel Programming and View-based Consistency

    E-print Network

    Werstein, Paul

    generated and executed at one processor be propagated to and executed in that order at other proces- sors processors, let alone to see them in order. Many parallel applications regulate their accesses to shared data by synchronization, so not all valid interleavings of their memory accesses are relevant to their real executions

  4. A Coordination Layer for Exploiting Task Parallelism with HPF

    E-print Network

    Orlando, Salvatore

    , has recently received much attention [6, 5]. Depending on the applications, HPF tasks can be organized. Replication entails using a processor farm structure [7], where incoming jobs are dispatched on one forms of task parallelism like pipelines and processor farms [11, 7]. We present #12; templates which

  5. Bipartite memory network architectures for parallel processing

    SciTech Connect

    Smith, W.; Kale, L.V. . Dept. of Computer Science)

    1990-01-01

    Parallel architectures are boradly classified as either shared memory or distributed memory architectures. In this paper, the authors propose a third family of architectures, called bipartite memory network architectures. In this architecture, processors and memory modules constitute a bipartite graph, where each processor is allowed to access a small subset of the memory modules, and each memory module allows access from a small set of processors. The architecture is particularly suitable for computations requiring dynamic load balancing. The authors explore the properties of this architecture by examining the Perfect Difference set based topology for the graph. Extensions of this topology are also suggested.

  6. Parallel methods for the flight simulation model

    SciTech Connect

    Xiong, Wei Zhong; Swietlik, C.

    1994-06-01

    The Advanced Computer Applications Center (ACAC) has been involved in evaluating advanced parallel architecture computers and the applicability of these machines to computer simulation models. The advanced systems investigated include parallel machines with shared. memory and distributed architectures consisting of an eight processor Alliant FX/8, a twenty four processor sor Sequent Symmetry, Cray XMP, IBM RISC 6000 model 550, and the Intel Touchstone eight processor Gamma and 512 processor Delta machines. Since parallelizing a truly efficient application program for the parallel machine is a difficult task, the implementation for these machines in a realistic setting has been largely overlooked. The ACAC has developed considerable expertise in optimizing and parallelizing application models on a collection of advanced multiprocessor systems. One of aspect of such an application model is the Flight Simulation Model, which used a set of differential equations to describe the flight characteristics of a launched missile by means of a trajectory. The Flight Simulation Model was written in the FORTRAN language with approximately 29,000 lines of source code. Depending on the number of trajectories, the computation can require several hours to full day of CPU time on DEC/VAX 8650 system. There is an impetus to reduce the execution time and utilize the advanced parallel architecture computing environment available. ACAC researchers developed a parallel method that allows the Flight Simulation Model to be able to run in parallel on the multiprocessor system. For the benchmark data tested, the parallel Flight Simulation Model implemented on the Alliant FX/8 has achieved nearly linear speedup. In this paper, we describe a parallel method for the Flight Simulation Model. We believe the method presented in this paper provides a general concept for the design of parallel applications. This concept, in most cases, can be adapted to many other sequential application programs.

  7. PVM Enhancement for Beowulf Multiple-Processor Nodes

    NASA Technical Reports Server (NTRS)

    Springer, Paul

    2006-01-01

    A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

  8. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  9. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Sharma, Anuj; Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  10. Single-Point Access to Data Distributed on Many Processors

    NASA Technical Reports Server (NTRS)

    James, Mark

    2007-01-01

    A description of the functions and data structures is defined that would be necessary to implement the Chapel concept of distributions, domains, allocation, access, and interfaces to the compiler for transformations from Chapel source to their run-time implementation for these concepts. A complete set of object-oriented operators is defined that enables one to access elements of a distributed array through regular arithmetic index sets, giving the programmer the illusion that all the elements are collocated on a single processor. This means that arbitrary regions of the arrays can be fragmented and distributed across multiple processors with a single point of access. This is important because it can significantly improve programmer productivity by allowing the programmers to concentrate on the high-level details of the algorithm without worrying about the efficiency and communication details of the underlying representation.

  11. Parallel Pascal - An extended Pascal for parallel computers

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.

    1984-01-01

    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  12. Design and Evaluation of the Hamal Parallel Computer

    E-print Network

    Grossman, J.P.

    2002-12-05

    Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new ...

  13. Design and evaluation of the Hamal parallel computer

    E-print Network

    Grossman, J. P., 1973-

    2003-01-01

    Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new ...

  14. ELIPS: Toward a Sensor Fusion Processor on a Chip

    NASA Technical Reports Server (NTRS)

    Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James

    1998-01-01

    The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.

  15. ELIPS: toward a sensor fusion processor on a chip

    NASA Astrophysics Data System (ADS)

    Daud, Taher; Stoica, Adrian; Thomas, Tyson; Li, Wei-te; Fabunmi, James A.

    1999-03-01

    The paper present the concept and initial test from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics an autonomous system are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an 'intelligent' processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.

  16. A parallel scheduler for block iterative solvers in heterogeneous computing environments

    SciTech Connect

    Arioli, M.; Drummond, A.; Ruiz, D.

    1995-12-01

    We present a parallel scheduler for distributing work to a group of processors in a heterogeneous computing environment. Some of the processors in the heterogeneous computing environment can be clustered to take advantage of particular communication networks. Here, the scheduler has been used in the implementation of a parallel block iterative solver based on the Cimmino method. We have used PVM 3 to implement the communication between the heterogeneous processors.

  17. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    E-print Network

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...

  18. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    E-print Network

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...

  19. Parallel MR Imaging

    PubMed Central

    Deshmane, Anagha; Gulani, Vikas; Griswold, Mark A.; Seiberlich, Nicole

    2015-01-01

    Parallel imaging is a robust method for accelerating the acquisition of magnetic resonance imaging (MRI) data, and has made possible many new applications of MR imaging. Parallel imaging works by acquiring a reduced amount of k-space data with an array of receiver coils. These undersampled data can be acquired more quickly, but the undersampling leads to aliased images. One of several parallel imaging algorithms can then be used to reconstruct artifact-free images from either the aliased images (SENSE-type reconstruction) or from the under-sampled data (GRAPPA-type reconstruction). The advantages of parallel imaging in a clinical setting include faster image acquisition, which can be used, for instance, to shorten breath-hold times resulting in fewer motion-corrupted examinations. In this article the basic concepts behind parallel imaging are introduced. The relationship between undersampling and aliasing is discussed and two commonly used parallel imaging methods, SENSE and GRAPPA, are explained in detail. Examples of artifacts arising from parallel imaging are shown and ways to detect and mitigate these artifacts are described. Finally, several current applications of parallel imaging are presented and recent advancements and promising research in parallel imaging are briefly reviewed. PMID:22696125

  20. Parallel contingency statistics with Titan.

    SciTech Connect

    Thompson, David C.; Pebay, Philippe Pierre

    2009-09-01

    This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.

  1. Parallel numerical reservoir simulation: A feasibility study

    SciTech Connect

    Michielse, P.H.

    1994-12-31

    This paper discusses a feasibility study to implement a parallel reservoir simulator on parallel computers. The basis of this study is a reservoir simulator that models an injection-production mechanism. The simulator implements a multigrid solver for the elliptic part of the equations, and uses adaptive local grid refinement to rack moving fronts in the reservoir. The parallelization method is based on a domain decomposition method, which assigns the subdomains to the processors. In order to obtain a correct solution, communication across the internal boundaries between the subdomains is required. The implementation of the multigrid method imposes restrictions on the domain decomposition. Furthermore, the adaptive local grid refinement may cause the work load distribution over the processors to be out of balance. Hence, some load balancing technique is required to ensure parallel efficiency. This parallel efficiency is illustrated by experiments on a Convex MetaSeries system.

  2. Parallel hypergraph partitioning for scientific computing.

    SciTech Connect

    Heaphy, Robert; Devine, Karen Dragon; Catalyurek, Umit; Bisseling, Robert; Hendrickson, Bruce Alan; Boman, Erik Gunnar

    2005-07-01

    Graph partitioning is often used for load balancing in parallel computing, but it is known that hypergraph partitioning has several advantages. First, hypergraphs more accurately model communication volume, and second, they are more expressive and can better represent nonsymmetric problems. Hypergraph partitioning is particularly suited to parallel sparse matrix-vector multiplication, a common kernel in scientific computing. We present a parallel software package for hypergraph (and sparse matrix) partitioning developed at Sandia National Labs. The algorithm is a variation on multilevel partitioning. Our parallel implementation is novel in that it uses a two-dimensional data distribution among processors. We present empirical results that show our parallel implementation achieves good speedup on several large problems (up to 33 million nonzeros) with up to 64 processors on a Linux cluster.

  3. Parallel Network Simulations with NEURON

    PubMed Central

    Migliore, M.; Cannia, C.; Lytton, W.W; Markram, Henry; Hines, M. L.

    2009-01-01

    The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored. PMID:16732488

  4. An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors

    NASA Technical Reports Server (NTRS)

    Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

    2015-01-01

    This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.

  5. Sequence information signal processor for local and global string comparisons

    DOEpatents

    Peterson, John C. (Alta Loma, CA); Chow, Edward T. (San Dimas, CA); Waterman, Michael S. (Culver City, CA); Hunkapillar, Timothy J. (Pasadena, CA)

    1997-01-01

    A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

  6. Automated anomaly detection processor

    NASA Astrophysics Data System (ADS)

    Kraiman, James B.; Arouh, Scott L.; Webb, Michael L.

    2002-07-01

    Robust exploitation of tracking and surveillance data will provide an early warning and cueing capability for military and civilian Law Enforcement Agency operations. This will improve dynamic tasking of limited resources and hence operational efficiency. The challenge is to rapidly identify threat activity within a huge background of noncombatant traffic. We discuss development of an Automated Anomaly Detection Processor (AADP) that exploits multi-INT, multi-sensor tracking and surveillance data to rapidly identify and characterize events and/or objects of military interest, without requiring operators to specify threat behaviors or templates. The AADP has successfully detected an anomaly in traffic patterns in Los Angeles, analyzed ship track data collected during a Fleet Battle Experiment to detect simulated mine laying behavior amongst maritime noncombatants, and is currently under development for surface vessel tracking within the Coast Guard's Vessel Traffic Service to support port security, ship inspection, and harbor traffic control missions, and to monitor medical surveillance databases for early alert of a bioterrorist attack. The AADP can also be integrated into combat simulations to enhance model fidelity of multi-sensor fusion effects in military operations.

  7. Customising compilers for customisable processors 

    E-print Network

    Murray, Alastair Colin

    2012-11-29

    The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements ...

  8. Flexible MIPS soft processor architecture

    E-print Network

    Carli, Roberto

    2008-01-01

    The flexible MIPS soft processor architecture borrows selected technologies from high-performance computing to deliver a modular, highly customizable CPU targeted towards FPGA implementations for embedded systems; the ...

  9. Flexible MIPS Soft Processor Architecture

    E-print Network

    Carli, Roberto

    2008-06-16

    The flexible MIPS soft processor architecture borrows selected technologies from high-performance computing to deliver a modular, highly customizable CPU targeted towards FPGA implementations for embedded systems; the ...

  10. Fully automatic telemetry data processor

    NASA Technical Reports Server (NTRS)

    Cox, F. B.; Keipert, F. A.; Lee, R. C.

    1968-01-01

    Satellite Telemetry Automatic Reduction System /STARS 2/, a fully automatic computer-controlled telemetry data processor, maximizes data recovery, reduces turnaround time, increases flexibility, and improves operational efficiency. The system incorporates a CDC 3200 computer as its central element.

  11. Parallel time integration software

    Energy Science and Technology Software Center (ESTSC)

    2014-07-01

    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds mustmore »come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.« less

  12. Progress in parallelizing XOOPIC

    SciTech Connect

    Mardahl, P.J.; Verboncoeur, J.P.

    1998-12-31

    XOOPIC (Object Oriented Particle in Cell code for X11-based Unix workstations) is presently a serial 2d 3v particle-in-cell plasma simulation. The present effort focuses on using parallel and distributed processing to optimize the simulation for large problems. The benefits include increased capacity for memory intensive problems, and improved performance for processor-intensive problems. The MPI library enables the parallel version to be easily ported to massively parallel, SMP, and distributed computers. The philosophy employed here is to spatially decompose the system into computational regions separated by virtual boundaries, objects which contain the local data and algorithms to perform the local field solve and particle communication between regions. This implementation will reduce the changes required in the rest of the program by parallelization. Specific implementation details such as the hiding of communication latency behind local computation will also be discussed. The initial implementation includes manual, partitioning in one spatial coordinate, electromagnetic models, diagnostics by computational region, and effective transmission of both fields and particles across virtual boundaries. This version was able to perform greater than 600,000 particle-pushes-per-second using 8 200MHz UltraSPARC CPU`s. In this work the authors extend parallel XOOPIC to have 2-d partitioning, automated partitioning, and global diagnostics.

  13. Parallel time integration software

    SciTech Connect

    2014-07-01

    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds must come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.

  14. Case for a field-programmable gate array multicore hybrid machine for an image-processing application

    NASA Astrophysics Data System (ADS)

    Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos

    2011-01-01

    General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.

  15. The use of a parallel virtual machine (PVM) for finite-difference wave simulations

    NASA Astrophysics Data System (ADS)

    Niccanna, Clodagh; Bean, Christopher J.

    1997-08-01

    Computer modelling is now applied routinely throughout the geosciences in an attempt to create synthetic data for comparison with real data. At present, in seismology, there is no analytical solution to the wave equation which allows wave simulations in "geologically realistic" (complex) media. Consequently, computationally expensive numerical solutions are required. Using a finite-difference solution to the wave equation provides a suitable means of modelling seismic waves in a heterogeneous medium. However, when applying this method the grid sizes and the number of time steps required (to ensure numerical stability and sufficiently long wave propagation distances) are limited because of their demand on computer time and memory. Supercomputers represent an obvious solution to these limitations. This paper presents an alternative which is inexpensive, convenient and portable. By clustering a set of processors, for example PCs or workstations, a parallel configuration can be obtained by using the processors available on each machine to perform sections of the calculations simultaneously. By using Parallel Virtual Machine (PVM) — a public domain software package which allows a programmer to create and access a concurrent computing system made from networks of loosely coupled processing elements (Geist and others, 1994) — we have reduced wall-clock times and increased array sizes for a finite-difference solution to the acoustic, elastic and viscoelastic wave equations. In this paper we present methods of parallelizing a serial code and load-balancing this parallelized code. A comparison of serial and parallel wall-clock times, a comparison of wall-clock times on a variety of clusters of machines and the role of communication in this application are presented for a finite-difference solution to the acoustic wave equation.

  16. Fast 3-D prestack depth migration with a parallel PSPI algorithm. Final report

    SciTech Connect

    Roberts, P.M.; Alde, D.M.; House, L.S.

    1997-06-01

    There was the need for general expertise in the porting of serial seismic reflection code to a parallel processing environment. The project was a continuation of Task Order 38 involving the improvement of existing parallel models developed for that task and to provide support in parallelizing other similar seismic codes to a miasively parallel processor environment.

  17. Parallel matrix transpose algorithms on distributed memory concurrent computers

    SciTech Connect

    Choi, J.; Walker, D.W.; Dongarra, J.J.

    1993-10-01

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  18. Parallel Processing of Broad-Band PPM Signals

    NASA Technical Reports Server (NTRS)

    Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

    2010-01-01

    A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).

  19. EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS

    SciTech Connect

    F. PETRINI; W. FENG

    1999-09-01

    We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

  20. Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

    NASA Astrophysics Data System (ADS)

    Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

    2013-03-01

    With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

  1. Photovoltaic cell array

    NASA Technical Reports Server (NTRS)

    Eliason, J. T. (inventor)

    1976-01-01

    A photovoltaic cell array consisting of parallel columns of silicon filaments is described. Each fiber is doped to produce an inner region of one polarity type and an outer region of an opposite polarity type to thereby form a continuous radial semi conductor junction. Spaced rows of electrical contacts alternately connect to the inner and outer regions to provide a plurality of electrical outputs which may be combined in parallel or in series.

  2. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-12-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processes. User programs and their gangs of processes are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantum are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory.

  3. Instrumentation for the development of parallel programs

    SciTech Connect

    Guarna, V.; Malony, A.

    1987-01-01

    Consider a parallel program composed of multiple tasks. These tasks execute independently except when they must synchronize to satisfy dependencies that exist in the program. The problem of instrumenting a multiple processor system for developing parallel programs is the need to identify, monitor and measure the activities of every program task, and to correlate this data with all program tasks. This need is present in all stages of the parallel program development cycle from debugging to performance analysis to optimization. This paper discusses the critical instrumentation issues for the development of parallel programs in the areas of performance analysis and debugging.

  4. Computing contingency statistics in parallel.

    SciTech Connect

    Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre

    2010-09-01

    Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

  5. Data communications in a parallel active messaging interface of a parallel computer

    DOEpatents

    Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

    2013-11-12

    Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.

  6. Dynamic processor allocation for adaptively parallel work-stealing jobs

    E-print Network

    Sen, Siddhartha, 1981-

    2004-01-01

    TCP's burstiness is usually regarded as harmful, or at best, inconvenient. Instead, this thesis suggests a new perspective and examines whether TCP's burstiness is useful for certain applications. It claims that burstiness ...

  7. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1987-01-01

    An explicit-explicit subcycling procedure for the finite element analysis of structural dynamics is developed. This procedure has relaxed the usual constraint of requiring integer time step ratios for adjacent nodal groups. This allows for greater advantage to be taken of local stability criteria, and thus improves the efficiency of the explicit time integrator. Example problems are included to demonstrate the accuracy and stability of the method.

  8. Parallel Catastrophe Modelling on a Cell Processor Frank Dehne1

    E-print Network

    Rau-Chaplin, Andrew

    - phe risk in many regions and perils all over the world. They are key elements of risk manage- ment of specific properties and their residents to per- ils including hurricanes, earthquakes, severe thunderstorms

  9. Software orchestration of instruction level parallelism on tiled processor architectures

    E-print Network

    Lee, Walter (Walter Cheng-Wan)

    2005-01-01

    Projection from silicon technology is that while transistor budget will continue to blossom according to Moore's law, latency from global wires will severely limit the ability to scale centralized structures at high ...

  10. Exploiting first-class arrays in Fortran for accelerator programming

    SciTech Connect

    Rasmussen, Craig E; Weseloh, Wayne N; Robey, Robert W; Matthew, Sottile J; Quinlan, Daniel; Overbye, Jeffrey

    2010-12-15

    Emerging architectures for high performance computing often are well suited to a data parallel programming model. This paper presents a simple programming methodology based on existing languages and compiler tools that allows programmers to take advantage of these systems. We will work with the array features of Fortran 90 to show how this infrequently exploited, standardized language feature is easily transformed to lower level accelerator code. Our transformations are based on a mapping from Fortran 90 to C++ code with OpenCL extensions. The sheer complexity of programming for clusters of many or multi-core processors with tens of millions threads of execution make the simplicity of the data parallel model attractive. Furthermore, the increasing complexity of todays applications (especially when convolved with the increasing complexity of the hardware) and the need for portability across hardware architectures make a higher-level and simpler programming model like data parallel attractive. The goal of this work has been to exploit source-to-source transformations that allow programmers to develop and maintain programs at a high-level of abstraction, without coding to a specific hardware architecture. Furthermore these transformations allow multiple hardware architectures to be targeted without changing the high-level source. It also removes the necessity for application programmers to understand details of the accelerator architecture or to know OpenCL.

  11. Progress in parallelizing XOOPIC

    NASA Astrophysics Data System (ADS)

    Mardahl, Peter; Verboncoeur, J. P.

    1997-11-01

    XOOPIC (Object Orient Particle in Cell code for X11-based Unix workstations) is presently a serial 2-D 3v particle-in-cell plasma simulation (J.P. Verboncoeur, A.B. Langdon, and N.T. Gladd, ``An object-oriented electromagnetic PIC code.'' Computer Physics Communications 87 (1995) 199-211.). The present effort focuses on using parallel and distributed processing to optimize the simulation for large problems. The benefits include increased capacity for memory intensive problems, and improved performance for processor-intensive problems. The MPI library is used to enable the parallel version to be easily ported to massively parallel, SMP, and distributed computers. The philosophy employed here is to spatially decompose the system into computational regions separated by 'virtual boundaries', objects which contain the local data and algorithms to perform the local field solve and particle communication between regions. This implementation will reduce the changes required in the rest of the program by parallelization. Specific implementation details such as the hiding of communication latency behind local computation will also be discussed.

  12. The hypercluster: A parallel processing test-bed architecture for computational mechanics applications

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1987-01-01

    The development of numerical methods and software tools for parallel processors can be aided through the use of a hardware test-bed. The test-bed architecture must be flexible enough to support investigations into architecture-algorithm interactions. One way to implement a test-bed is to use a commercial parallel processor. Unfortunately, most commercial parallel processors are fixed in their interconnection and/or processor architecture. In this paper, we describe a modified n cube architecture, called the hypercluster, which is a superset of many other processor and interconnection architectures. The hypercluster is intended to support research into parallel processing of computational fluid and structural mechanics problems which may require a number of different architectural configurations. An example of how a typical partial differential equation solution algorithm maps on to the hypercluster is given.

  13. SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008

    SciTech Connect

    2008-09-08

    The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

  14. Parallel matrix multiplication on the Connection Machine

    NASA Technical Reports Server (NTRS)

    Tichy, Walter F.

    1988-01-01

    Matrix multiplication is a computation and communication intensive problem. Six parallel algorithms for matrix multiplication on the Connection Machine are presented and compared with respect to their performance and processor usage. For n by n matrices, the algorithms have theoretical running times of O(n to the 2nd power log n), O(n log n), O(n), and O(log n), and require n, n to the 2nd power, n to the 2nd power, and n to the 3rd power processors, respectively. With careful attention to communication patterns, the theoretically predicted runtimes can indeed be achieved in practice. The parallel algorithms illustrate the tradeoffs between performance, communication cost, and processor usage.

  15. Parallel algorithms for boundary value problems

    NASA Technical Reports Server (NTRS)

    Lin, Avi

    1990-01-01

    A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.

  16. Parallel Performance of a Combustion Chemistry Simulation

    DOE PAGESBeta

    Skinner, Gregg; Eigenmann, Rudolf

    1995-01-01

    We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.

  17. Parallel Data Mining for Association Rules

    E-print Network

    Zaki, Mohammed Javeed

    Parallel Data Mining for Association Rules on Shared­memory Multi­processors \\Lambda M. J. Zaki, M algorithms have been proposed for data mining of association rules. However, research so far has mainly. In this paper we will concentrate on data mining for association rules. The problem of mining association rules

  18. Graph Partitioning Models for Parallel Computing

    SciTech Connect

    Hendrickson, B.; Kolda, T.G.

    1999-03-02

    Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.

  19. An 81.6 ?W FastICA processor for epileptic seizure detection.

    PubMed

    Yang, Chia-Hsiang; Shih, Yi-Hsin; Chiueh, Herming

    2015-02-01

    To improve the performance of epileptic seizure detection, independent component analysis (ICA) is applied to multi-channel signals to separate artifacts and signals of interest. FastICA is an efficient algorithm to compute ICA. To reduce the energy dissipation, eigenvalue decomposition (EVD) is utilized in the preprocessing stage to reduce the convergence time of iterative calculation of ICA components. EVD is computed efficiently through an array structure of processing elements running in parallel. Area-efficient EVD architecture is realized by leveraging the approximate Jacobi algorithm, leading to a 77.2% area reduction. By choosing proper memory element and reduced wordlength, the power and area of storage memory are reduced by 95.6% and 51.7%, respectively. The chip area is minimized through fixed-point implementation and architectural transformations. Given a latency constraint of 0.1 s, an 86.5% area reduction is achieved compared to the direct-mapped architecture. Fabricated in 90 nm CMOS, the core area of the chip is 0.40 mm(2). The FastICA processor, part of an integrated epileptic control SoC, dissipates 81.6 ?W at 0.32 V. The computation delay of a frame of 256 samples for 8 channels is 84.2 ms. Compared to prior work, 0.5% power dissipation, 26.7% silicon area, and 3.4 × computation speedup are achieved. The performance of the chip was verified by human dataset. PMID:24968296

  20. Parallel community climate model: Description and user`s guide

    SciTech Connect

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H.

    1996-07-15

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.

  1. LAPACK Working Note #216: A novel parallel QR algorithm

    E-print Network

    Kågström, Bo

    LAPACK Working Note #216: A novel parallel QR algorithm for hybrid distributed memory HPC systems R the codes to distributed memory platforms with multithreaded nodes, such as multicore processors. Numerous early deflation, parallel algorithms, hybrid distributed memory systems. 1 Introduction Computing

  2. Parallel optimization algorithms and their implementation in VLSI design

    NASA Technical Reports Server (NTRS)

    Lee, G.; Feeley, J. J.

    1991-01-01

    Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.

  3. Formally Defining and Verifying Master/Slave Speculative Parallelization

    E-print Network

    Zilles, Craig

    Formally Defining and Verifying Master/Slave Speculative Parallelization Pierre Salverda, Grigore,grosu,zilles}@cs.uiuc.edu Abstract. Master/Slave Speculative Parallelization (MSSP) is a new execution paradigm that decouples independently and concurrently on slower, but correct, slave processors. This work reports on the first steps

  4. 42 PARALLEL ALGORITHMS IN GEOMETRY Michael T. Goodrich

    E-print Network

    Goodrich, Michael T.

    application of this technique, consider the problem of constructing the upper convex hull of a S set of n each and recursively construct the upper convex hull of the points in each list. Assign a processor42 PARALLEL ALGORITHMS IN GEOMETRY Michael T. Goodrich INTRODUCTION The goal of parallel algorithm

  5. PARALLEL COMPUTER SIMULATION TECHNIQUES FOR THE STUDY OF MACROMOLECULES

    E-print Network

    Wilson, Mark R.

    PARALLEL COMPUTER SIMULATION TECHNIQUES FOR THE STUDY OF MACROMOLECULES Mark R. Wilson and Jaroslav years two important developments in computing have occurred. At the high-cost end of the scale, supercomputers have become parallel comput- ers. The ultra-fast (specialist) processors and the expensive vector-computers

  6. Object Oriented Parallel Computation for Plasma Charles D. Norton

    E-print Network

    Bystroff, Chris

    Organization]: Multiple Data Stream Architectures---parallel processors; D.1.5 [Software]: ProgrammingObject Oriented Parallel Computation for Plasma Simulation Charles D. Norton Department of Computer Science, Rensselaer Polytechnic Institute, Troy NY, USA and Boleslaw K. Szymanski Department of Computer

  7. Butterfly Project Report Large-Scale Parallel Programming

    E-print Network

    Scott, Michael L.

    Butterfly Project Report 22 Large-Scale Parallel Programming: Experience with the BBN Butterfly Butterfly Parallel Processor Thomas J. LeBlanc, Michael L. Scott, and Christopher M. Brown Department in the world. In the course of our work with the Butterfly we have ported three compilers, developed five major

  8. CFD Solvers on Many-core Processors

    E-print Network

    Brandvik, Tobias

    2008-11-11

    Tobias Brandvik Whittle Laboratory CFD Solvers on Many-core Processors – p.1/36 CFD Backgroud CFD: Computational Fluid Dynamics Whittle Laboratory - Turbomachinery CFD Solvers on Many-core Processors – p.2/36 Turbomachinery CFD Solvers on Many...

  9. 7 CFR 1208.18 - Processor.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE PROCESSED RASPBERRY PROMOTION, RESEARCH, AND INFORMATION ORDER Processed Raspberry Promotion, Research, and Information Order Definitions § 1208.18 Processor. Processor means a person engaged in the preparation of raspberries for...

  10. 7 CFR 1208.18 - Processor.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE PROCESSED RASPBERRY PROMOTION, RESEARCH, AND INFORMATION ORDER Processed Raspberry Promotion, Research, and Information Order Definitions § 1208.18 Processor. Processor means a person engaged in the preparation of raspberries for...

  11. 7 CFR 989.13 - Processor.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ...MARKETING SERVICE (Marketing Agreements and Orders; Fruits, Vegetables, Nuts), DEPARTMENT OF AGRICULTURE RAISINS PRODUCED FROM GRAPES GROWN IN CALIFORNIA Order Regulating Handling Definitions § 989.13 Processor. Processor means any...

  12. 7 CFR 989.13 - Processor.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ...MARKETING SERVICE (MARKETING AGREEMENTS AND ORDERS; FRUITS, VEGETABLES, NUTS), DEPARTMENT OF AGRICULTURE RAISINS PRODUCED FROM GRAPES GROWN IN CALIFORNIA Order Regulating Handling Definitions § 989.13 Processor. Processor means any...

  13. Massive affordable computing using ARM processors in high energy physics

    NASA Astrophysics Data System (ADS)

    Smith, J. W.; Hamilton, A.

    2015-05-01

    High Performance Computing is relevant in many applications around the world, particularly high energy physics. Experiments such as ATLAS, CMS, ALICE and LHCb generate huge amounts of data which need to be stored and analyzed at server farms located on site at CERN and around the world. Apart from the initial cost of setting up an effective server farm the cost of power consumption and cooling are significant. The proposed solution to reduce costs without losing performance is to make use of ARM® processors found in nearly all smartphones and tablet computers. Their low power consumption, low cost and respectable processing speed makes them an interesting choice for future large scale parallel data processing centers. Benchmarks on the CortexTM-A series of ARM® processors including the HPL and PMBW suites will be presented as well as preliminary results from the PROOF benchmark in the context of high energy physics will be analyzed.

  14. Dynamically Reconfigurable Optical Morphological Processor

    NASA Technical Reports Server (NTRS)

    Chao, Tien-Hsin

    1996-01-01

    Experimental optical/electronic image-processing system performs morphological processing in optical domain. System operates at high speed. Also dynamically reconfigurable, switched rapidly among all forms of morphological processing. Major advantage over correlator-based optical morphological processors in which morphological operations governed by fixed holographic filters.

  15. A Course on Reconfigurable Processors

    ERIC Educational Resources Information Center

    Shoufan, Abdulhadi; Huss, Sorin A.

    2010-01-01

    Reconfigurable computing is an established field in computer science. Teaching this field to computer science students demands special attention due to limited student experience in electronics and digital system design. This article presents a compact course on reconfigurable processors, which was offered at the Technische Universitat Darmstadt,…

  16. Dual-Sampler Processor Digitizes CCD Output

    NASA Technical Reports Server (NTRS)

    Salomon, P. M.

    1986-01-01

    Circuit for processing output of charge-coupled device (CCD) imager provides increased time for analog-to-digital conversion, thereby reducing bandwidth required for video processing. Instead of one sampleand-hold circuit of conventional processor, improved processor includes two sample-and-hold circuits alternated with each other. Dual-sampler processor operates with lower bandwidth and with timing requirements less stringent than those of single-sample processor.

  17. Processor architecture for airborne SAR systems

    NASA Technical Reports Server (NTRS)

    Glass, C. M.

    1983-01-01

    Digital processors for spaceborne imaging radars and application of the technology developed for airborne SAR systems are considered. Transferring algorithms and implementation techniques from airborne to spaceborne SAR processors offers obvious advantages. The following topics are discussed: (1) a quantification of the differences in processing algorithms for airborne and spaceborne SARs; and (2) an overview of three processors for airborne SAR systems.

  18. EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR

    E-print Network

    Chu, Pong P.

    language) synthesis and FPGA devices and the availability of soft-core processors allow designers of examples. An Altera FPGA prototyping board and its Nios II soft-core processor are used for this purpose processor and IP (intellectual property) core based system, the partition and integration of software

  19. Evaluation of a Java Processor Martin Schoeberl

    E-print Network

    Schoeberl, Martin

    440K gates. aJile's JEMCore is a direct-execution Java processor that is available as both an IP core) is an implementation of the Java virtual machine (JVM) in a low-cost FPGA. JOP is the smallest hardware realization smaller than a comparable RISC processor in an FPGA. Although JOP is intended as a processor for embedded

  20. Automatic differentiation for design sensitivity analysis of structural systems using multiple processors

    NASA Technical Reports Server (NTRS)

    Nguyen, Duc T.; Storaasli, Olaf O.; Qin, Jiangning; Qamar, Ramzi

    1994-01-01

    An automatic differentiation tool (ADIFOR) is incorporated into a finite element based structural analysis program for shape and non-shape design sensitivity analysis of structural systems. The entire analysis and sensitivity procedures are parallelized and vectorized for high performance computation. Small scale examples to verify the accuracy of the proposed program and a medium scale example to demonstrate the parallel vector performance on multiple CRAY C90 processors are included.

  1. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  2. Research in Parallel Algorithms and Software for Computational Aerosciences

    NASA Technical Reports Server (NTRS)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  3. Embedded SoPC design with Nios II processor and VHDL examples

    E-print Network

    Chu, Pong P.

    and can be implemented and tested on the board. Some can be used as basic IP cores to be incorporated into a single FPGA (field-programmable gate array) device. In addition to the customized software, customized to configure the soft-core processor, create tailored I/O interfaces, and develop specialized hardware

  4. TMVOC-MP: a parallel numerical simulator for Three-PhaseNon-isothermal Flows of Multicomponent Hydrocarbon Mixtures inporous/fractured media

    SciTech Connect

    Zhang, Keni; Yamamoto, Hajime; Pruess, Karsten

    2008-02-15

    TMVOC-MP is a massively parallel version of the TMVOC code (Pruess and Battistelli, 2002), a numerical simulator for three-phase non-isothermal flow of water, gas, and a multicomponent mixture of volatile organic chemicals (VOCs) in multidimensional heterogeneous porous/fractured media. TMVOC-MP was developed by introducing massively parallel computing techniques into TMVOC. It retains the physical process model of TMVOC, designed for applications to contamination problems that involve hydrocarbon fuels or organic solvents in saturated and unsaturated zones. TMVOC-MP can model contaminant behavior under 'natural' environmental conditions, as well as for engineered systems, such as soil vapor extraction, groundwater pumping, or steam-assisted source remediation. With its sophisticated parallel computing techniques, TMVOC-MP can handle much larger problems than TMVOC, and can be much more computationally efficient. TMVOC-MP models multiphase fluid systems containing variable proportions of water, non-condensible gases (NCGs), and water-soluble volatile organic chemicals (VOCs). The user can specify the number and nature of NCGs and VOCs. There are no intrinsic limitations to the number of NCGs or VOCs, although the arrays for fluid components are currently dimensioned as 20, accommodating water plus 19 components that may be either NCGs or VOCs. Among them, NCG arrays are dimensioned as 10. The user may select NCGs from a data bank provided in the software. The currently available choices include O{sub 2}, N{sub 2}, CO{sub 2}, CH{sub 4}, ethane, ethylene, acetylene, and air (a pseudo-component treated with properties averaged from N{sub 2} and O{sub 2}). Thermophysical property data of VOCs can be selected from a chemical data bank, included with TMVOC-MP, that provides parameters for 26 commonly encountered chemicals. Users also can input their own data for other fluids. The fluid components may partition (volatilize and/or dissolve) among gas, aqueous, and NAPL phases. Any combination of the three phases may present, and phases may appear and disappear in the course of a simulation. In addition, VOCs may be adsorbed by the porous medium, and may biodegrade according to a simple half-life model. Detailed discussion of physical processes, assumptions, and fluid properties used in TMVOC-MP can be found in the TMVOC user's guide (Pruess and Battistelli, 2002). TMVOC-MP was developed based on the parallel framework of the TOUGH2-MP code (Zhang et al. 2001, Wu et al. 2002). It uses the MPI (Message Passing Forum, 1994) for parallel implementation. A domain decomposition approach is adopted for the parallelization. The code partitions a simulation domain, defined by an unstructured grid, using partitioning algorithm from the METIS software package (Karypsis and Kumar, 1998). In parallel simulation, each processor is in charge of one part of the simulation domain for assembling mass and energy balance equations, solving linear equation systems, updating thermophysical properties, and performing other local computations. The local linear-equation systems are solved in parallel by multiple processors with the Aztec linear solver package (Tuminaro et al., 1999). Although each processor solves the linearized equations of subdomains independently, the entire linear equation system is solved together by all processors collaboratively via communication between neighboring processors during each iteration. Detailed discussion of the prototype of the data-exchange scheme can be found in Elmroth et al. (2001). In addition, FORTRAN 90 features are introduced to TMVOC-MP, such as dynamic memory allocation, array operation, matrix manipulation, and replacing 'common blocks' (used in the original TMVOC) with modules. All new subroutines are written in FORTRAN 90. Program units imported from the original TMVOC remain in standard FORTRAN 77. This report provides a quick starting guide for using the TMVOC-MP program. We suppose that the users have basic knowledge of using the original TMVOC code. The users can find the detailed technical descrip

  5. Massively parallel computational fluid dynamics calculations for aerodynamics and aerothermodynamics applications

    SciTech Connect

    Payne, J.L.; Hassan, B.

    1998-09-01

    Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.

  6. Programming environment for parallel-vision algorithms. Final technical report, February 1988-December 1989

    SciTech Connect

    Brown, C.

    1990-04-11

    This contract developed and disseminated papers, ideas, algorithms, analysis, software, applications, and implementations for parallel programming environments for computer vision and for vision applications. The work has been widely reported and highly influential. The most significant work centered on the Butterfly Parallel Processor, the MaxVideo pipelined parallel image processor, and the development of the real-time computer vision laboratory. For the Butterfly, the Psyche multi-model operating system was developed and the CONSUL autoparallelizing compiler was designed. Much basic and influential performance monitoring and debugging work was completed, resulting in working systems and novel algorithms. There was also significant research in systems and applications using other parallel architectures in the laboratory, such as the MaxVideo parallel pipelined image processor. The contract developed a heterogeneous parallel architecture involving pipelined and MIMD parallelism and integrated it with a robot head.

  7. Runtime volume visualization for parallel CFD

    NASA Technical Reports Server (NTRS)

    Ma, Kwan-Liu

    1995-01-01

    This paper discusses some aspects of design of a data distributed, massively parallel volume rendering library for runtime visualization of parallel computational fluid dynamics simulations in a message-passing environment. Unlike the traditional scheme in which visualization is a postprocessing step, the rendering is done in place on each node processor. Computational scientists who run large-scale simulations on a massively parallel computer can thus perform interactive monitoring of their simulations. The current library provides an interface to handle volume data on rectilinear grids. The same design principles can be generalized to handle other types of grids. For demonstration, we run a parallel Navier-Stokes solver making use of this rendering library on the Intel Paragon XP/S. The interactive visual response achieved is found to be very useful. Performance studies show that the parallel rendering process is scalable with the size of the simulation as well as with the parallel computer.

  8. Towards Distributed Memory Parallel Program Analysis

    SciTech Connect

    Quinlan, D; Barany, G; Panas, T

    2008-06-17

    This paper presents a parallel attribute evaluation for distributed memory parallel computer architectures where previously only shared memory parallel support for this technique has been developed. Attribute evaluation is a part of how attribute grammars are used for program analysis within modern compilers. Within this work, we have extended ROSE, a open compiler infrastructure, with a distributed memory parallel attribute evaluation mechanism to support user defined global program analysis required for some forms of security analysis which can not be addressed by a file by file view of large scale applications. As a result, user defined security analyses may now run in parallel without the user having to specify the way data is communicated between processors. The automation of communication enables an extensible open-source parallel program analysis infrastructure.

  9. ARCHITECTURE OF ASYNCHRONOUS CELLULAR PROCESSOR ARRAY FOR IMAGE SKELETONIZATION

    E-print Network

    Dudek, Piotr

    operations at high speeds. Most of the CPA designs employ an SIMD paradigm and operate in a synchronous mode to the control signals, which are generated internally. Such local control strategy significantly extends.e. each processing cell is connected to 6 neighbours. Such processing grid provides several advantages

  10. Specification and preliminary design of an array processor

    NASA Technical Reports Server (NTRS)

    Slotnick, D. L.; Graham, M. L.

    1975-01-01

    The design of a computer suited to the class of problems typified by the general circulation of the atmosphere was investigated. A fundamental goal was that the resulting machine should have roughly 100 times the computing capability of an IBM 360/95 computer. A second requirement was that the machine should be programmable in a higher level language similar to FORTRAN. Moreover, the new machine would have to be compatible with the IBM 360/95 since the IBM machine would continue to be used for pre- and post-processing. A third constraint was that the cost of the new machine was to be significantly less than that of other extant machines of similar computing capability, such as the ILLIAC IV and CDC STAR. A final constraint was that it should be feasible to fabricate a complete system and put it in operation by early 1978. Although these objectives were generally met, considerable work remains to be done on the routing system.

  11. Parallel fault-tolerant robot control

    NASA Technical Reports Server (NTRS)

    Hamilton, D. L.; Bennett, J. K.; Walker, I. D.

    1992-01-01

    A shared memory multiprocessor architecture is used to develop a parallel fault-tolerant robot controller. Several versions of the robot controller are developed and compared. A robot simulation is also developed for control observation. Comparison of a serial version of the controller and a parallel version without fault tolerance showed the speedup possible with the coarse-grained parallelism currently employed. The performance degradation due to the addition of processor fault tolerance was demonstrated by comparison of these controllers with their fault-tolerant versions. Comparison of the more fault-tolerant controller with the lower-level fault-tolerant controller showed how varying the amount of redundant data affects performance. The results demonstrate the trade-off between speed performance and processor fault tolerance.

  12. A Parallel Rendering Algorithm for MIMD Architectures

    NASA Technical Reports Server (NTRS)

    Crockett, Thomas W.; Orloff, Tobias

    1991-01-01

    Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.

  13. Simulating the scheduling of parallel supercomputer applications

    SciTech Connect

    Seager, M.K.; Stichnoth, J.M.

    1989-09-19

    An Event Driven Simulator for Evaluating Multiprocessing Scheduling (EDSEMS) disciplines is presented. The simulator is made up of three components: machine model; parallel workload characterization ; and scheduling disciplines for mapping parallel applications (many processes cooperating on the same computation) onto processors. A detailed description of how the simulator is constructed, how to use it and how to interpret the output is also given. Initial results are presented from the simulation of parallel supercomputer workloads using Dog-Eat-Dog,'' Family'' and Gang'' scheduling disciplines. These results indicate that Gang scheduling is far better at giving the number of processors that a job requests than Dog-Eat-Dog or Family scheduling. In addition, the system throughput and turnaround time are not adversely affected by this strategy. 10 refs., 8 figs., 1 tab.

  14. Knowledge representation into Ada parallel processing

    NASA Technical Reports Server (NTRS)

    Masotto, Tom; Babikyan, Carol; Harper, Richard

    1990-01-01

    The Knowledge Representation into Ada Parallel Processing project is a joint NASA and Air Force funded project to demonstrate the execution of intelligent systems in Ada on the Charles Stark Draper Laboratory fault-tolerant parallel processor (FTPP). Two applications were demonstrated - a portion of the adaptive tactical navigator and a real time controller. Both systems are implemented as Activation Framework Objects on the Activation Framework intelligent scheduling mechanism developed by Worcester Polytechnic Institute. The implementations, results of performance analyses showing speedup due to parallelism and initial efficiency improvements are detailed and further areas for performance improvements are suggested.

  15. Language constructs for modular parallel programs

    SciTech Connect

    Foster, I.

    1996-03-01

    We describe programming language constructs that facilitate the application of modular design techniques in parallel programming. These constructs allow us to isolate resource management and processor scheduling decisions from the specification of individual modules, which can themselves encapsulate design decisions concerned with concurrence, communication, process mapping, and data distribution. This approach permits development of libraries of reusable parallel program components and the reuse of these components in different contexts. In particular, alternative mapping strategies can be explored without modifying other aspects of program logic. We describe how these constructs are incorporated in two practical parallel programming languages, PCN and Fortran M. Compilers have been developed for both languages, allowing experimentation in substantial applications.

  16. An optical processor for zero-crossing edge detection

    NASA Technical Reports Server (NTRS)

    Jared, David A.; Johnson, Kristina M.

    1993-01-01

    An optical processor for zero-crossing edge detection is presented, which consists of two defocused imaging systems to perform the Gaussian convolutions and a VLSI, ferroelectric liquid crystal spatial light modulator (SLM) to determine the zero-crossings. The zero-crossing SLM is a 32 x 32 array of pixels located on 100 microns centers. Each pixels contains a phototransistor, an auto-scaling amplifier, a zero-crossing detection circuit, and a liquid crystal modulating pad. Electrical and optical characteristics of the zero-crossing SLM are presented along with experimental results of the system.

  17. Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

    NASA Technical Reports Server (NTRS)

    Fricker, David M.

    1997-01-01

    The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.

  18. A Parallel Algorithm for the Vehicle Routing Problem

    SciTech Connect

    Groer, Christopher S; Golden, Bruce; Edward, Wasil

    2011-01-01

    The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.

  19. FPGA wavelet processor design using language for instruction-set architectures (LISA)

    NASA Astrophysics Data System (ADS)

    Meyer-Bäse, Uwe; Vera, Alonzo; Rao, Suhasini; Lenk, Karl; Pattichis, Marios

    2007-04-01

    The design of an microprocessor is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration, software design (assembler, linker, loader, profiler), architecture implementation (RTL generation for FPGA or cell-based ASIC) and verification. The Language for instruction-set architectures (LISA) allows to model a microprocessor not only from instruction-set but also from architecture description including pipelining behavior that allows a design and development tool consistency over all levels of the design. To explore the capability of the LISA processor design platform a.k.a. CoWare Processor Designer we present in this paper three microprocessor designs that implement a 8/8 wavelet transform processor that is typically used in today's FBI fingerprint compression scheme. We have designed a 3 stage pipelined 16 bit RISC processor (NanoBlaze). Although RISC ?Ps are usually considered "fast" processors due to design concept like constant instruction word size, deep pipelines and many general purpose registers, it turns out that DSP operations consume essential processing time in a RISC processor. In a second step we have used design principles from programmable digital signal processor (PDSP) to improve the throughput of the DWT processor. A multiply-accumulate operation along with indirect addressing operation were the key to achieve higher throughput. A further improvement is possible with today's FPGA technology. Today's FPGAs offer a large number of embedded array multipliers and it is now feasible to design a "true" vector processor (TVP). A multiplication of two vectors can be done in just one clock cycle with our TVP, a complete scalar product in two clock cycles. Code profiling and Xilinx FPGA ISE synthesis results are provided that demonstrate the essential improvement that a TVP has compared with traditional RISC or PDSP designs.

  20. Hash based parallel algorithms for mining association rules

    SciTech Connect

    Shintani, Takahiko; Kitsuregawa, Masaru

    1996-12-31

    In this paper, we propose four parallel algorithms (NPA, SPA, HPA and RPA-ELD) for mining association rules on shared-nothing parallel machines to improve its performance. In NPA, candidate itemsets are just copied amongst all the processors, which can lead to memory overflow for large transaction databases. The remaining three algorithms partition the candidate itemsets over the processors. If it is partitioned simply (SPA), transaction data has to be broadcast to all processors. HPA partitions the candidate itemsets using a hash function to eliminate broadcasting, which also reduces the comparison workload significantly. HPA-ELD fully utilizes the available memory space by detecting the extremely large itemsets and copying them, which is also very effective at flattering the load over the processors. We implemented these algorithms in a shared-nothing environment. Performance evaluations show that the best algorithm, HPA-ELD, attains good linearity on speedup ratio and is effective for handling skew.