Science.gov

Sample records for parallel processor array

  1. Integration of IR focal plane arrays with massively parallel processor

    NASA Astrophysics Data System (ADS)

    Esfandiari, P.; Koskey, P.; Vaccaro, K.; Buchwald, W.; Clark, F.; Krejca, B.; Rekeczky, C.; Zarandy, A.

    2008-04-01

    The intent of this investigation is to replace the low fill factor visible sensor of a Cellular Neural Network (CNN) processor with an InGaAs Focal Plane Array (FPA) using both bump bonding and epitaxial layer transfer techniques for use in the Ballistic Missile Defense System (BMDS) interceptor seekers. The goal is to fabricate a massively parallel digital processor with a local as well as a global interconnect architecture. Currently, this unique CNN processor is capable of processing a target scene in excess of 10,000 frames per second with its visible sensor. What makes the CNN processor so unique is that each processing element includes memory, local data storage, local and global communication devices and a visible sensor supported by a programmable analog or digital computer program.

  2. Digital Parallel Processor Array for Optimum Path Planning

    NASA Technical Reports Server (NTRS)

    Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

    1996-01-01

    The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

  3. Parallel processing in a host plus multiple array processor system for radar

    NASA Technical Reports Server (NTRS)

    Barkan, B. Z.

    1983-01-01

    Host plus multiple array processor architecture is demonstrated to yield a modular, fast, and cost-effective system for radar processing. Software methodology for programming such a system is developed. Parallel processing with pipelined data flow among the host, array processors, and discs is implemented. Theoretical analysis of performance is made and experimentally verified. The broad class of problems to which the architecture and methodology can be applied is indicated.

  4. Array processor architecture

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1983-01-01

    A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

  5. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  6. Spaceborne Processor Array

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  7. Array processors in chemistry

    SciTech Connect

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  8. Optical systolic array processor using residue arithmetic

    NASA Technical Reports Server (NTRS)

    Jackson, J.; Casasent, D.

    1983-01-01

    The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

  9. Parallel Analog-to-Digital Image Processor

    NASA Technical Reports Server (NTRS)

    Lokerson, D. C.

    1987-01-01

    Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

  10. The AIS-5000 parallel processor

    SciTech Connect

    Schmitt, L.A.; Wilson, S.S.

    1988-05-01

    The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In this paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.

  11. Modeling algorithm execution time on processor arrays

    NASA Technical Reports Server (NTRS)

    Adams, L. M.; Crockett, T. W.

    1984-01-01

    An approach to modelling the execution time of algorithms on parallel arrays is presented. This time is expressed as a function of the number of processors and system parameters. The resulting model has been applied to a parallel implementation of the conjugate-gradient algorithm on NASA's FEM. Results of experiments performed to compare the model predictions against actual behavior show that the floating-point arithmetic, communication, and synchronization components of the parallel algorithm execution time were correctly modelled. The results also show that the overhead caused by the interaction of the system software and the actual parallel hardware must be reflected in the model parameters. The model has been used to predict the performance of the conjugate gradient algorithm on a given problem as the number of processors and machine characteristics varied.

  12. Adapting implicit methods to parallel processors

    SciTech Connect

    Reeves, L.; McMillin, B.; Okunbor, D.; Riggins, D.

    1994-12-31

    When numerically solving many types of partial differential equations, it is advantageous to use implicit methods because of their better stability and more flexible parameter choice, (e.g. larger time steps). However, since implicit methods usually require simultaneous knowledge of the entire computational domain, these methods axe difficult to implement directly on distributed memory parallel processors. This leads to infrequent use of implicit methods on parallel/distributed systems. The usual implementation of implicit methods is inefficient due to the nature of parallel systems where it is common to take the computational domain and distribute the grid points over the processors so as to maintain a relatively even workload per processor. This creates a problem at the locations in the domain where adjacent points are not on the same processor. In order for the values at these points to be calculated, messages have to be exchanged between the corresponding processors. Without special adaptation, this will result in idle processors during part of the computation, and as the number of idle processors increases, the lower the effective speed improvement by using a parallel processor.

  13. Ultrafast Fourier-transform parallel processor

    SciTech Connect

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  14. Array processor with multiple broadcasting

    SciTech Connect

    Kumar, V.K.P.; Raghavendra, C.S.

    1987-04-01

    In this paper the authors consider a generalized broadcasting feature for mesh connected computers (MCCs) which consists of N = N/sup 1/2/ x N/sup 1/2/ processors with broadcasting features in each row and each column. This multiple broadcast allows parallel data transfers within rows and columns of processors. The proposed architecture is suited for solution of problems in linear algebra, image processing, computational geometry, and numerical computations. They develop parallel algorithms for many problems in these areas for example, they can find max in O(N/sup 1/6/), median in O(N/sup 1/6/(log N)/sup 2/3/), convex polygon of a digitized picture in O(N/sup 1/6/), and nearest neighbor in O(N/sup 1/6/), while these problems need ..cap omega..(N/sup 1/3/) on a 2-MCC with single broadcast. The authors also derive bounds on the speedups obtainable with broadcasting.

  15. Ray tracing on a networked processor array

    NASA Astrophysics Data System (ADS)

    Yang, Jungsook; Lee, Seung Eun; Chen, Chunyi; Bagherzadeh, Nader

    2010-10-01

    As computation costs increase to meet design requirements for computation-intensive graphics applications on today's embedded systems, the pressure to develop high-performance parallel processors on a chip will increase. Acceleration of the ray tracing computation has become a major issue as the computer graphics industry demands for rendering realistic images. Network-on-chip (NoC) techniques that interconnect multiple processing elements with routers are the solution for reducing computation time and power consumption by parallel processing on a chip. It is also essential to meet the scalability and complexity challenges for system-on-chip (SoC). In this article, we describe a parallel ray tracing application mapping on a mesh-based multicore NoC architecture. We describe an optimised ray tracing kernel and parallelisation strategies, varying the workload distribution statically and dynamically. In this work, we present results and timing performance of our parallel ray tracing application on a NoC, which are obtained through our cycle accurate multicore NoC simulator. Using a dynamic scheduling load balancing technique, we achieved a maximum speedup multiplier of 35.97 on an 8 × 8 networked processor array using a NoC as the interconnect.

  16. Assignment Of Finite Elements To Parallel Processors

    NASA Technical Reports Server (NTRS)

    Salama, Moktar A.; Flower, Jon W.; Otto, Steve W.

    1990-01-01

    Elements assigned approximately optimally to subdomains. Mapping algorithm based on simulated-annealing concept used to minimize approximate time required to perform finite-element computation on hypercube computer or other network of parallel data processors. Mapping algorithm needed when shape of domain complicated or otherwise not obvious what allocation of elements to subdomains minimizes cost of computation.

  17. SLAPP: A systolic linear algebra parallel processor

    SciTech Connect

    Drake, B.L.; Luk, F.T.; Speiser, J.M.; Symanski, J.J.

    1987-07-01

    Systolic array computer architectures provide a means for fast computation of the linear algebra algorithms that form the building blocks of many signal-processing algorithms, facilitating their real-time computation. For applications to signal processing, the systolic array operates on matrices, an inherently parallel view of the data, using numerical linear algebra algorithms that have been suitably parallelized to efficiently utilize the available hardware. This article describes work currently underway at the Naval Ocean Systems Center, San Diego, California, to build a two-dimensional systolic array, SLAPP, demonstrating efficient and modular parallelization of key matric computations for real-time signal- and image-processing problems.

  18. Grundy: Parallel Processor Architecture Makes Programming Easy

    NASA Astrophysics Data System (ADS)

    Meier, Robert J.

    1985-12-01

    Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

  19. Two-dimensional mesh-connected parallel processor with complex processing elements

    NASA Astrophysics Data System (ADS)

    Chen, Chaoyang; Shen, Xubang; Wang, Zhong; Sang, Hongshi

    2001-09-01

    LS MPP is a massively parallel processor .It has fine-grained parallelism with up to 4096 processing elements arranged in a SIMD architecture .The processing elements are arranged in 64x64 two-dimensional mesh-connected array for low-level image processing .In this paper, the system architecture ,the components of processing element ,array controller ,memory organization of LS MPP processor are described .In the final ,we have discussed the performance of LS MPP.

  20. Scalable Unix tools on parallel processors

    SciTech Connect

    Gropp, W.; Lusk, E.

    1994-12-31

    The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.

  1. APEmille: a parallel processor in the teraflop range

    NASA Astrophysics Data System (ADS)

    Panizzi, E.

    1997-02-01

    APEmille is a SIMD parallel processor under development at the Italian National Institute for Nuclear Physics (INFN). It is the third machine of the APE family, following Ape and Ape100 and delivering peak performance in the Tflops range. APEmille is very well suited for Lattice QCD applications, both for its hardware characteristics and for its software and language features. APEmille is an array of custom arithmetic processors arranged on a tridimensional torus. The replicated processor is a pipelined VLIW device performing integer and single/double precision IEEE floating point operations. The processor is optimized for complex computations and has a peak performance of 528Mflop at 66MHz. Each replica has 8 Mbytes of locally addressable RAM. In principle an array of 2048 nodes is able to break the Tflops barrier. Two other custom processors are used for program flow control, global addressing and inter node communications. Fast nearest neighbour communications as well as longer distance communications and data broadcast are available. APEmille is interfaced to the external world by a PCI interface and a HIPPI channel. A network of PCs act as the host computer. The APE operating system and the cross compiler run on it. A powerful programming language named TAO is provided and is highly optimized for QCD. A C++ compiler is foreseen. The TAO language is as simple as Fortran but as powerful as object oriented languages. Specific data structures, operators and even statements can be defined by the user for each different application. Effort has been made to define the language constructs for QCD.

  2. Trajectory optimization on a parallel processor

    NASA Astrophysics Data System (ADS)

    Betts, John T.; Huffman, William P.

    Sparse finite differencing has been applied to a multiple shooting formulation of the two-point boundary value problem in a manner which is suitable for implementation on a parallel processor. Results are presented for a series of exoatmospheric trajectory optimization problems consisting of a number of burn and coast arcs. In the present method, finite burns are represented by constant thrust and weight flow. Examples considered include the maximum payload transfer to a specified mission orbit using two and three burns, with the pointing chosen to be inertially fixed on each arc, and an optimal control problem requiring definition of the optimal steering angles for a low-thrust trajectory between circular orbits with plane change.

  3. Global Arrays Parallel Programming Toolkit

    SciTech Connect

    Nieplocha, Jaroslaw; Krishnan, Manoj Kumar; Palmer, Bruce J.; Tipparaju, Vinod; Harrison, Robert J.; Chavarría-Miranda, Daniel

    2011-01-01

    The two predominant classes of programming models for parallel computing are distributed memory and shared memory. Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in modern computers this characteristic can have a negative impact on performance and scalability. Careful code restructuring to increase data reuse and replacing fine grain load/stores with block access to shared data can address the problem and yield performance for shared memory that is competitive with message-passing. However, this performance comes at the cost of compromising the ease of use that the shared memory model advertises. Distributed memory models, such as message-passing or one-sided communication, offer performance and scalability but they are difficult to program. The Global Arrays toolkit attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed by the programmer. This management is achieved by calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be specified by the programmer and hence managed. GA is related to the global address space languages such as UPC, Titanium, and, to a lesser extent, Co-Array Fortran. In addition, by providing a set of data-parallel operations, GA is also related to data-parallel languages such as HPF, ZPL, and Data Parallel C. However, the Global Array programming model is implemented as a library that works with most languages used for technical computing and does not rely on compiler technology for achieving

  4. Acceleration of computer-generated hologram by Greatly Reduced Array of Processor Element with Data Reduction

    NASA Astrophysics Data System (ADS)

    Sugiyama, Atsushi; Masuda, Nobuyuki; Oikawa, Minoru; Okada, Naohisa; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

    2014-11-01

    We have implemented a computer-generated hologram (CGH) calculation on Greatly Reduced Array of Processor Element with Data Reduction (GRAPE-DR) processors. The cost of CGH calculation is enormous, but CGH calculation is well suited to parallel computation. The GRAPE-DR is a multicore processor that has 512 processor elements. The GRAPE-DR supports a double-precision floating-point operation and can perform CGH calculation with high accuracy. The calculation speed of the GRAPE-DR system is seven times faster than that of a personal computer with an Intel Core i7-950 processor.

  5. Scan line graphics generation on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1988-01-01

    Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

  6. Breadboard Signal Processor for Arraying DSN Antennas

    NASA Technical Reports Server (NTRS)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; Rayhrer, Benno

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  7. A systolic array parallelizing compiler

    SciTech Connect

    Tseng, P.S. )

    1990-01-01

    This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

  8. MILP model for resource disruption in parallel processor system

    NASA Astrophysics Data System (ADS)

    Nordin, Syarifah Zyurina; Caccetta, Louis

    2015-02-01

    In this paper, we consider the existence of disruption on unrelated parallel processor scheduling system. The disruption occurs due to a resource shortage where one of the parallel processors is facing breakdown problem during the task allocation, which give impact to the initial scheduling plan. Our objective is to reschedule the original unrelated parallel processor scheduling after the resource disruption that minimizes the makespan. A mixed integer linear programming model is presented for the recovery scheduling that considers the post-disruption policy. We conduct a computational experiment with different stopping time limit to see the performance of the model by using CPLEX 12.1 solver in AIMMS 3.10 software.

  9. High density packaging and interconnect of massively parallel image processors

    NASA Technical Reports Server (NTRS)

    Carson, John C.; Indin, Ronald J.

    1991-01-01

    This paper presents conceptual designs for high density packaging of parallel processing systems. The systems fall into two categories: global memory systems where many processors are packaged into a stack, and distributed memory systems where a single processor and many memory chips are packaged into a stack. Thermal behavior and performance are discussed.

  10. Chemical network problems solved on NASA/Goddard's massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Cho, Seog Y.; Carmichael, Gregory R.

    1987-01-01

    The single instruction stream, multiple data stream Massively Parallel Processor (MPP) unit consists of 16,384 bit serial arithmetic processors configured as a 128 x 128 array whose speed can exceed that of current supercomputers (Cyber 205). The applicability of the MPP for solving reaction network problems is presented and discussed, including the mapping of the calculation to the architecture, and CPU timing comparisons.

  11. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J.

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  12. Overtaking Vehicle Detection Method and Its Implementation Using IMAPCAR Highly Parallel Image Processor

    NASA Astrophysics Data System (ADS)

    Sakurai, Kazuyuki; Kyo, Shorin; Okazaki, Shin'ichiro

    This paper describes the real-time implementation of a vision-based overtaking vehicle detection method for driver assistance systems using IMAPCAR, a highly parallel SIMD linear array processor. The implemented overtaking vehicle detection method is based on optical flows detected by block matching using SAD and detection of the flows' vanishing point. The implementation is done efficiently by taking advantage of the parallel SIMD architecture of IMAPCAR. As a result, video-rate (33 frames/s) implementation could be achieved.

  13. Massively Parallel MRI Detector Arrays

    PubMed Central

    Keil, Boris; Wald, Lawrence L

    2013-01-01

    Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

  14. Singular value decomposition utilizing parallel algorithms on graphical processors

    SciTech Connect

    Kotas, Charlotte W; Barhen, Jacob

    2011-01-01

    One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements for a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder

  15. Parallel processor simulator for multiple optic channel architectures

    NASA Astrophysics Data System (ADS)

    Wailes, Tom S.; Meyer, David G.

    1992-12-01

    A parallel processing architecture based on multiple channel optical communication is described and compared with existing interconnection strategies for parallel computers. The proposed multiple channel architecture (MCA) uses MQW-DBR lasers to provide a large number of independent, selectable channels (or virtual buses) for data transport. Arbitrary interconnection patterns as well as machine partitions can be emulated via appropriate channel assignments. Hierarchies of parallel architectures and simultaneous execution of parallel tasks are also possible. Described are a basic overview of the proposed architecture, various channel allocation strategies that can be utilized by the MCA, and a summary of advantages of the MCA compared with traditional interconnection techniques. Also describes is a comprehensive multiple processor simulator that has been developed to execute parallel algorithms using the MCA as a data transport mechanism between processors and memory units. Simulation results -- including average channel load, effective channel utilization, and average network latency for different algorithms and different transmission speeds -- are also presented.

  16. DFT algorithms for bit-serial GaAs array processor architectures

    NASA Technical Reports Server (NTRS)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  17. Global synchronization of parallel processors using clock pulse width modulation

    DOEpatents

    Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

    2013-04-02

    A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

  18. Staging memory for massively parallel processor

    NASA Technical Reports Server (NTRS)

    Batcher, Kenneth E. (Inventor)

    1988-01-01

    The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

  19. Dynamic overset grid communication on distributed memory parallel processors

    NASA Technical Reports Server (NTRS)

    Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

    1993-01-01

    A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.

  20. An Evaluation of Document Retrieval from Serial Files Using the ICL Distributed Array Processor.

    ERIC Educational Resources Information Center

    Pogue, Christine; Willett, Peter

    1984-01-01

    Describes preliminary investigation of the use of International Computers Limited's Distributed Array Processor (DAP) for parallel searching of large serial files of documents. DAP hardware and software, test collections, measurement of DAP performance, search algorithms, experimental results, and DAP suitability for interactive searching are…

  1. Mapping Radiosity Computations to Parallel Processors.

    NASA Astrophysics Data System (ADS)

    Singh, Gautam Bir

    The radiosity method for rendering scenes is gaining popularity because of its ability to accurately model the energy distribution in an environment. As this photonic energy distribution is independent of the viewer's position, generating scenes for different viewpoints only requires hidden surface removal and can be performed in real-time. This makes it more attractive than ray tracing as a technique for modeling illumination. It is quite conceivable that radiosity method will be used for applications in scientific visualization, lighting simulations, CAD/CAM, virtual reality, and medical imaging. Computing radiosity of a scene with moderate to high complexity is tantamount to solving a system of tens of thousands of linear equations. Iterative linear system solvers, such as Gauss-Seidel, Jacobi, or conjugate descent, are quite demanding for a system of equations this large. An alternate approach, known as progressive refinement, offers some computational tractability and delivers an approximate solution relatively quickly. This dissertation presents the results of partitioning the radiosity computation to suitably map on a variety of multiprocessor classes. The effect of problem decomposition on computation and communication components is studied for the shared memory, the message passing and the loosely coupled distributed memory multiprocessors. Kendall Square Research's KSR1 and Intel hypercube iPSC/860 were used for experimenting with the shared memory and message-passing algorithms respectively. A network of IBM RS/6000 was used for understanding coarse grain parallelization techniques. These experiments demonstrated that optimality of parallel algorithms must be considered as a < machine, algorithm > pair. Thus the notion of program portability must also take machine architecture in consideration beside allowing for software compatibility. As the number of polygons for processing complex scenes continues to grow, the subdivision in the object space become

  2. Real-time trajectory optimization on parallel processors

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.

    1993-01-01

    A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

  3. Automatic generation of synchronization instructions for parallel processors

    SciTech Connect

    Midkiff, S.P.

    1986-05-01

    The development of high speed parallel multi-processors, capable of parallel execution of doacross and forall loops, has stimulated the development of compilers to transform serial FORTRAN programs to parallel forms. One of the duties of such a compiler must be to place synchronization instructions in the parallel version of the program to insure the legal execution order of doacross and forall loops. This thesis gives strategies usable by a compiler to generate these synchronization instructions. It presents algorithms for reducing the parallelism in FORTRAN programs to match a target architecture, recovering some of the parallelism so discarded, and reducing the number of synchronization instructions that must be added to a FORTRAN program, as well as basic strategies for placing synchronization instructions. These algorithms are developed for two synchronization instruction sets. 20 refs., 56 figs.

  4. Potential of minicomputer/array-processor system for nonlinear finite-element analysis

    NASA Technical Reports Server (NTRS)

    Strohkorb, G. A.; Noor, A. K.

    1983-01-01

    The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.

  5. Fabrication of fault-tolerant systolic array processors

    SciTech Connect

    Golovko, V.A.

    1995-05-01

    Methods for designing fault-tolerant systolic array processors are discussed. Several ways of bypassing faulty elements in configurations, which depend on an input-data flow organization, are suggested. An analysis of the additional hardware costs of providing fault tolerance by various techniques and for various levels of redundancy is presented. Hadamard fault-tolerant processor design was used to illustrate the efficiency of the techniques suggested.

  6. A Josephson systolic array processor for multiplication/addition operations

    SciTech Connect

    Morisue, M.; Li, F.Q.; Tobita, M.; Kaneko, S. )

    1991-03-01

    A novel Josephson systolic array processor to perform multiplication/addition operations is proposed. The systolic array processor proposed here consists of a set of three kinds of interconnected cells of which main circuits are made by using SQUID gates. A multiplication of 2 bits by 2 bits is performed in the single cell at a time and an addition of three data with two bits is simultaneously performed in an another type of cell. Furthermore, information in this system flows between cells in a pipeline fashion so that a high performance can be achieved. In this paper the principle of Josephson systolic array processor is described in detail and the simulation results are illustrated for the multiplication/addition of (4 bits {times} 4 bits + 8 bits). The results show that these operations can be executed in 330ps.

  7. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    DOEpatents

    Chatterjee, Siddhartha; Gunnels, John A.

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  8. Ring-array processor distribution topology for optical interconnects

    NASA Technical Reports Server (NTRS)

    Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

    1992-01-01

    The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.

  9. Feasibility of optically interconnected parallel processors using wavelength division multiplexing

    SciTech Connect

    Deri, R.J.; De Groot, A.J.; Haigh, R.E.

    1996-03-01

    New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little information is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.

  10. Analog parallel processor hardware for high speed pattern recognition

    NASA Technical Reports Server (NTRS)

    Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

    1990-01-01

    A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

  11. Parallel information transfer in a multinode quantum information processor.

    PubMed

    Borneman, T W; Granade, C E; Cory, D G

    2012-04-01

    We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel.

  12. Optimal mapping of irregular finite element domains to parallel processors

    NASA Technical Reports Server (NTRS)

    Flower, J.; Otto, S.; Salama, M.

    1987-01-01

    Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

  13. Guidelines for efficient use of optical systolic array processors

    SciTech Connect

    Casasent, D.

    1983-01-01

    The design, error analysis, component accuracy required, computational capacity, data flow and pipelining, plus the algorithm and application all seriously impact the use of optical systolic array processors. The author provides initial remarks, results, examples and solutions for each of these issues. 20 references.

  14. The language parallel Pascal and other aspects of the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  15. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1990-01-01

    Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.

  16. Frequency-multiplexed and pipelined iterative optical systolic array processors

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Jackson, J.; Neuman, C.

    1983-01-01

    Optical matrix processors using acoustooptic transducers are described, with emphasis on new systolic array architectures using frequency multiplexing in addition to space and time multiplexing. A Kalman filtering application is considered in a case study from which the operations required on such a system can be defined. This also serves as a new and powerful application for iterative optical processors. The importance of pipelining the data flow and the ordering of the operations performed in a specific application of such a system are also noted. Several examples of how to effectively achieve this are included. A new technique for handling bipolar data on such architectures is also described.

  17. Increasing the Power of a University Computing System with Attached Array Processors.

    ERIC Educational Resources Information Center

    Grimison, Alec

    1982-01-01

    Array processors are emerging as one cost-effective way of increasing the computing power of existing university computer systems. Two array processor installations at Cornell University and implications for other colleges and universities are discussed. (Author/JN)

  18. Particle simulation of plasmas on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Gledhill, I. M. A.; Storey, L. R. O.

    1987-01-01

    Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.

  19. An optical inner-product array processor for associative retrieval

    NASA Astrophysics Data System (ADS)

    Kung, S. Y.; Liu, H. K.

    1986-06-01

    In this paper, an inner-product array processor for the associative retrieval problem is presented. First, the algorithm and architecture of the array processor design are discussed. Then an optical implementation scheme is proposed. The matrix model of the associative memory is adopted. In this model, if one of the M vectors is to be reliably recalled, the dimension of the vectors, N, must be much larger than M. By taking advantage of this fact, our result offers a factor of 1/1â- 17â€"M- saving on the matrix elements. More significantly, real-time inputting and updating of the matrix elements can be potentially implemented with existing space-variant holographic elements and recently discovered liquid crystal television spatial light modulators.

  20. Implementation of SAR interferometric map generation using parallel processors

    SciTech Connect

    Doren, N.; Wahl, D.E.

    1998-07-01

    Interferometric fringe maps are generated by accurately registering a pair of complex SAR images of the same scene imaged from two very similar geometries, and calculating the phase difference between the two images by averaging over a neighborhood of pixels at each spatial location. The phase difference (fringe) map resulting from this IFSAR operation is then unwrapped and used to calculate the height estimate of the imaged terrain. Although the method used to calculate interferometric fringe maps is well known, it is generally executed in a post-processing mode well after the image pairs have been collected. In that mode of operation, there is little concern about algorithm speed and the method is normally implemented on a single processor machine. This paper describes how the interferometric map generation is implemented on a distributed-memory parallel processing machine. This particular implementation is designed to operate on a 16 node Power-PC platform and to generate interferometric maps in near real-time. The implementation is able to accommodate large translational offsets, along with a slight amount of rotation which may exist between the interferometric pair of images. If the number of pixels in the IFSAR image is large enough, the implementation accomplishes nearly linear speed-up times with the addition of processors.

  1. On program restructuring, scheduling, and communication for parallel processor systems

    SciTech Connect

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  2. An informal introduction to program transformation and parallel processors

    SciTech Connect

    Hopkins, K.W.

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  3. The performance realities of massively parallel processors: A case study

    SciTech Connect

    Lubeck, O.M.; Simmons, M.L.; Wasserman, H.J.

    1992-07-01

    This paper presents the results of an architectural comparison of SIMD massive parallelism, as implemented in the Thinking Machines Corp. CM-2 computer, and vector or concurrent-vector processing, as implemented in the Cray Research Inc. Y-MP/8. The comparison is based primarily upon three application codes that represent Los Alamos production computing. Tests were run by porting optimized CM Fortran codes to the Y-MP, so that the same level of optimization was obtained on both machines. The results for fully-configured systems, using measured data rather than scaled data from smaller configurations, show that the Y-MP/8 is faster than the 64k CM-2 for all three codes. A simple model that accounts for the relative characteristic computational speeds of the two machines, and reduction in overall CM-2 performance due to communication or SIMD conditional execution, is included. The model predicts the performance of two codes well, but fails for the third code, because the proportion of communications in this code is very high. Other factors, such as memory bandwidth and compiler effects, are also discussed. Finally, the paper attempts to show the equivalence of the CM-2 and Y-MP programming models, and also comments on selected future massively parallel processor designs.

  4. Solution of large linear systems of equations on the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Ida, Nathan; Udawatta, Kapila

    1987-01-01

    The Massively Parallel Processor (MPP) was designed as a special machine for specific applications in image processing. As a parallel machine, with a large number of processors that can be reconfigured in different combinations it is also applicable to other problems that require a large number of processors. The solution of linear systems of equations on the MPP is investigated. The solution times achieved are compared to those obtained with a serial machine and the performance of the MPP is discussed.

  5. Optimal evaluation of array expressions on massively parallel machines

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

    1992-01-01

    We investigate the problem of evaluating FORTRAN 90 style array expressions on massively parallel distributed-memory machines. On such machines, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression. The choice of where to perform the operation then affects this cost. We present algorithms based on dynamic programming to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider expressions containing operations that change the shape of the arrays, and show that our approach extends naturally to handle this case.

  6. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

    1999-08-24

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

  7. Massively parallel processor networks with optical express channels

    DOEpatents

    Deri, Robert J.; Brooks, III, Eugene D.; Haigh, Ronald E.; DeGroot, Anthony J.

    1999-01-01

    An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

  8. Partitioning: An essential step in mapping algorithms into systolic array processors

    SciTech Connect

    Navarro, J.J.; Llaberia, J.M.; Valero, M.

    1987-07-01

    Many scientific and technical applications require high computing speed; those involving matrix computations are typical. For applications involving matrix computations, algorithmically specialized, high-performance, low-cost architectures have been conceived and implemented. Systolic array processors (SAPs) are a good example of these machines. An SAP is a regular array of simple processing elements (PEs) that have a nearest-neighbor interconnection pattern. The simplicity, modularity, and expandability of SAPs make them suitable for VLSI/WSI implementation. Algorithms that are efficiently executed on SAPs are called systolic algorithms (SAs). An SA uses an array of systolic cells whose parallel operations must be specified. When an SA is executed on an SAP, the specified computations of each cell are carried out by a PE of the SAP.

  9. On nonlinear finite element analysis in single-, multi- and parallel-processors

    NASA Technical Reports Server (NTRS)

    Utku, S.; Melosh, R.; Islam, M.; Salama, M.

    1982-01-01

    Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

  10. Serial multiplier arrays for parallel computation

    NASA Technical Reports Server (NTRS)

    Winters, Kel

    1990-01-01

    Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

  11. Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

    NASA Technical Reports Server (NTRS)

    Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

    1997-01-01

    A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

  12. Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

    NASA Technical Reports Server (NTRS)

    Chen, Paul Peichuan

    1993-01-01

    Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

  13. Track recognition in 4 [mu]s by a systolic trigger processor using a parallel Hough transform

    SciTech Connect

    Klefenz, F.; Noffz, K.H.; Conen, W.; Zoz, R.; Kugel, A. . Lehrstuhl fuer Informatik V); Maenner, R. . Lehrstuhl fuer Informatik V Univ. Heidelberg . Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen)

    1993-08-01

    A parallel Hough transform processor has been developed that identifies circular particle tracks in a 2D projection of the OPAL jet chamber. The high-speed requirements imposed by the 8 bunch crossing mode of LEP could be fulfilled by computing the starting angle and the radius of curvature for each well defined track in less than 4 [mu]s. The system consists of a Hough transform processor that determines well defined tracks, and a Euler processor that counts their number by applying the Euler relation to the thresholded result of the Hough transform. A prototype of a systolic processor has been built that handles one sector of the jet chamber. It consists of 35 [times] 32 processing elements that were loaded into 21 programmable gate arrays (XILINX). This processor runs at a clock rate of 40 MHz. It has been tested offline with about 1,000 original OPAL events. No deviations from the off-line simulation have been found. A trigger efficiency of 93% has been obtained. The prototype together with the associated drift time measurement unit has been installed at the OPAL detector at LEP and 100k events have been sampled to evaluate the system under detector conditions.

  14. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    PubMed

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads. PMID:27504236

  15. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    PubMed

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads.

  16. High speed vision processor with reconfigurable processing element array based on full-custom distributed memory

    NASA Astrophysics Data System (ADS)

    Chen, Zhe; Yang, Jie; Shi, Cong; Qin, Qi; Liu, Liyuan; Wu, Nanjian

    2016-04-01

    In this paper, a hybrid vision processor based on a compact full-custom distributed memory for near-sensor high-speed image processing is proposed. The proposed processor consists of a reconfigurable processing element (PE) array, a row processor (RP) array, and a dual-core microprocessor. The PE array includes two-dimensional processing elements with a compact full-custom distributed memory. It supports real-time reconfiguration between the PE array and the self-organized map (SOM) neural network. The vision processor is fabricated using a 0.18 µm CMOS technology. The circuit area of the distributed memory is reduced markedly into 1/3 of that of the conventional memory so that the circuit area of the vision processor is reduced by 44.2%. Experimental results demonstrate that the proposed design achieves correct functions.

  17. Using algebra for massively parallel processor design and utilization

    NASA Technical Reports Server (NTRS)

    Campbell, Lowell; Fellows, Michael R.

    1990-01-01

    This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

  18. A garbage collection algorithm for shared memory parallel processors

    SciTech Connect

    Crammond, J. )

    1988-12-01

    This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.

  19. Parallel transport gates in a mixed-species ion trap processor

    NASA Astrophysics Data System (ADS)

    Home, Jonathan

    Scaled up quantum information processors will require large numbers of parallel gate operations. For ion trap quantum processing, a promising approach is to perform these operations in separated regions of a multi-zone processing chip between which quantum information is transported either by distributed photonic entanglement or by deterministic shuttling of the ions through the array. However scaling the technology for controlling pulsed laser beams which address each of multiple regions appears challenging. I will describe recent work on the control of both beryllium and calcium ions by transporting ions through static laser beams. We have demonstrated both parallel individually addressed operations as well as sequences of operations. Work is in progress towards multi-qubit gates, which requires good control of the ion transport velocity. We have developed a number of techniques for measuring and optimizing velocities in our trap, enabling significant improvements in performance. In addition to direct results, I will give an overview of our multi-species apparatus, including recent results on high fidelity multi-qubit gates. We are grateful for funding from the Swiss National Science Foundation and the ETH Zurich.

  20. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted

    1989-01-01

    A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

  1. Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

    1994-01-01

    In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  2. Parallel/Series-Fed Microstrip Array Antenna

    NASA Technical Reports Server (NTRS)

    Huang, John

    1994-01-01

    Characteristics include low cross-polarization and high efficiency. Microstrip array antenna fabricated on two rectangular dielectric substrates. Produces fan-shaped beam polarized parallel to its short axis. Mounted conformally on outside surface of aircraft for use in synthetic-aperture radar. Other antennas of similar design mounted on roofs or sides of buildings, ships, or land vehicles for use in radar or communications.

  3. Preliminary study on the potential usefulness of array processor techniques for structural synthesis

    NASA Technical Reports Server (NTRS)

    Feeser, L. J.

    1980-01-01

    The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.

  4. Parallel processors and nonlinear structural dynamics algorithms and software

    NASA Technical Reports Server (NTRS)

    Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

    1989-01-01

    The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

  5. Operation of an adaptive processor using a photorefractive parallel integrator

    NASA Astrophysics Data System (ADS)

    Vachss, Frederick; Hong, John; Keefer, Chris; Malowicki, John

    1992-12-01

    A parallel optical technique for use in the adaptive processing of radar signals is described. We have demonstrated such a technique in an acousto-optic adaptive signal processing system designed to null the jamming of radar signals. This system features a time integrating correlator using a new type of photorefractive spatial light modulator exhibiting high levels of contrast, resolution, uniformity and sensitivity. Using this system we have demonstrated nulling of both narrow and wide band RF signals. The operation of the photorefractive integrating device, its performance and that of the overall system are discussed.

  6. Parallel calculation of multi-electrode array correlation networks.

    PubMed

    Ribeiro, Pedro; Simonotto, Jennifer; Kaiser, Marcus; Silva, Fernando

    2009-11-15

    When calculating correlation networks from multi-electrode array (MEA) data, one works with extensive computations. Unfortunately, as the MEAs grow bigger, the time needed for the computation grows even more: calculating pair-wise correlations for current 60 channel systems can take hours on normal commodity computers whereas for future 1000 channel systems it would take almost 280 times as long, given that the number of pairs increases with the square of the number of channels. Even taking into account the increase of speed in processors, soon it can be unfeasible to compute correlations in a single computer. Parallel computing is a way to sustain reasonable calculation times in the future. We provide a general tool for rapid computation of correlation networks which was tested for: (a) a single computer cluster with 16 cores, (b) the Newcastle Condor System utilizing idle processors of university computers and (c) the inter-cluster, with 192 cores. Our reusable tool provides a simple interface for neuroscientists, automating data partition and job submission, and also allowing coding in any programming language. It is also sufficiently flexible to be used in other high-performance computing environments. PMID:19666054

  7. Coupled cluster algorithms for networks of shared memory parallel processors

    NASA Astrophysics Data System (ADS)

    Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.

    2007-05-01

    As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.

  8. High-speed Systolic Array Processor (HISSAP) system development synopsis: Lesson learned. Final report, Oct 83-Oct 90

    SciTech Connect

    Loughlin, J.P.

    1991-05-01

    This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic and software mapping tools is described. Based on the experience gained in the mapping of matrix-based algorithms on the testbed hardware, specific recommendations are presented in the form of lessons learned, which are intended to offer guidance in the development of future Navy signal processing systems.

  9. Task rescheduling model for resource disruption problem in unrelated parallel processor system

    NASA Astrophysics Data System (ADS)

    Nordin, Syarifah Zyurina; Caccetta, Lou

    2014-07-01

    In this paper, we concentrate on the scheduling problem with interruption occurs in the parallel processor system. The situation happens when the availability of the unrelated parallel processors in the time slot decreases in certain time periods and its define as resource disruption. Our objective is to consider a recovery scheduling option for this issue to overcome the possibilities of having infeasibility of the original scheduling plan. Our approach for the recovery is task rescheduling which is to assign the tasks in the initial schedule plan to reflect the new restrictions. A recovery mixed integer linear programming model is proposed to solve the disruption problem. We also conduct a computational experiment using CPLEX 12.1 solver in AIMMS 3.10 software to analyze the performance of the model.

  10. Array distribution in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  11. Aligning parallel arrays to reduce communication

    NASA Technical Reports Server (NTRS)

    Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

    1994-01-01

    Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

  12. Construction of a parallel processor for simulating manipulators and other mechanical systems

    NASA Technical Reports Server (NTRS)

    Hannauer, George

    1991-01-01

    This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

  13. Parallel microfluidic arrays for SPRi detection

    NASA Astrophysics Data System (ADS)

    Ouellet, Eric; Lausted, Christopher; Lin, Tao; Yang, Cheng-Wei; Hood, Leroy; Lagally, Eric T.

    2010-04-01

    Surface Plasmon Resonance imaging (SPRi) is a label-free technique for the quantitation of binding affinities and concentrations for a wide variety of target molecules. Although SPRi is capable of determining binding constants for multiple ligands in parallel, current commercial instruments are limited to a single analyte stream and a limited number of ligand spots. Measurement of target concentration also requires the serial introduction of different target concentrations; such repeated experiments are conducted manually and are therefore time-intensive. Likewise, the equilibrium determination of concentration for known binding affinity requires long times due to diffusion-limited kinetics to a surface-immobilized ligand. We have developed an integrated microfluidic array using soft lithography techniques for SPRi-based detection and determination of binding affinities for DNA aptamers against human alphathrombin. The device consists of 264 element-addressable chambers of 700 pL each isolated by microvalves. The device also contains a dilution network for simultaneous interrogation of up to six different target concentrations, further speeding detection times. The element-addressable design of the array allows interrogation of multiple ligands against multiple targets, and analytes from individual chambers may be collected for downstream analysis.

  14. Parallel microfluidic arrays for SPRi detection

    NASA Astrophysics Data System (ADS)

    Ouellet, Eric; Lausted, Christopher; Hood, Leroy; Lagally, Eric T.

    2008-08-01

    Surface Plasmon Resonance imaging (SPRi) is a label-free technique for the quantitation of binding affinities and concentrations for a wide variety of target molecules. Although SPRi is capable of determining binding constants for multiple ligands in parallel, current commercial instruments are limited to a single analyte stream and a limited number of ligand spots. Measurement of target concentration also requires the serial introduction of different target concentrations; such repeated experiments are conducted manually and are therefore time-intensive. Likewise, the equilibrium determination of concentration for known binding affinity requires long times due to diffusion-limited kinetics to a surface-immobilized ligand. We have developed an integrated microfluidic array using soft lithography techniques for SPRi-based detection and determination of binding affinities for DNA aptamers against human alphathrombin. The device consists of 264 element-addressable chambers isolated by microvalves. The resulting 700 pL volumes surrounding each ligand spot promise to decrease measurement time through reaction rate-limited kinetics. The device also contains a dilution network for simultaneous interrogation of up to six different target concentrations, further speeding detection times. Finally, the element-addressable design of the array allows interrogation of multiple ligands against multiple targets.

  15. Parallel fabrication of plasmonic nanocone sensing arrays.

    PubMed

    Horrer, Andreas; Schäfer, Christian; Broch, Katharina; Gollmer, Dominik A; Rogalski, Jan; Fulmes, Julia; Zhang, Dai; Meixner, Alfred J; Schreiber, Frank; Kern, Dieter P; Fleischer, Monika

    2013-12-01

    A fully parallel approach for the fabrication of arrays of metallic nanocones and triangular nanopyramids is presented. Different processes utilizing nanosphere lithography for the creation of etch masks are developed. Monolayers of spheres are reduced in size and directly used as masks, or mono- and double layers are employed as templates for the deposition of aluminum oxide masks. The masks are transferred into an underlying gold or silver layer by argon ion milling, which leads to nanocones or nanopyramids with very sharp tips. Near the tips the enhancement of an external electromagnetic field is particularly strong. This fact is confirmed by numerical simulations and by luminescence imaging in a confocal microscope. Such localized strong fields can amongst others be utilized for high-resolution, high-sensitivity spectroscopy and sensing of molecules near the tip. Arrays of such plasmonic nanostructures thus constitute controllable platforms for surface-enhanced Raman spectroscopy. A thin film of pentacene molecules is evaporated onto both nanocone and nanopyramid substrates, and the observed Raman enhancement is evaluated.

  16. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    SciTech Connect

    Cancelo, G.; Gottschalk, Erik Edward; Pavlicek, V.; Wang, M.; Wu, J.

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  17. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications

    SciTech Connect

    Vydyanathan, Naga; Krishnamoorthy, Sriram; Sabin, Gerald M.; Catalyurek, Umit V.; Kurc, Tahsin; Sadayappan, Ponnuswamy; Saltz, Joel H.

    2009-08-01

    Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application-tasks with dependences. These applications exhibit both task- and data-parallelism, and combining these two (also called mixedparallelism), has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task- and data-parallelism required to minimize the parallel completion time (makespan) of these applications. In other words, our algorithm determines the set of tasks that should be run concurrently and the number of processors to be allocated to each task. The processor allocation and scheduling decisions are made in an integrated manner and are based on several factors such as the structure of the taskgraph, the runtime estimates and scalability characteristics of the tasks and the inter-task data communication volumes. A locality conscious scheduling strategy is used to improve inter-task data reuse. Evaluation through simulations and actual executions of task graphs derived from real applications as well as synthetic graphs shows that our algorithm consistently generates schedules with lower makespan as compared to CPR and CPA, two previously proposed scheduling algorithms. Our algorithm also produces schedules that have lower makespan than pure taskand data-parallel schedules. For task graphs with known optimal schedules or lower bounds on the makespan, our algorithm generates schedules that are closer to the optima than other scheduling approaches.

  18. The fast multipole method on parallel clusters, multicore processors, and graphics processing units

    NASA Astrophysics Data System (ADS)

    Darve, Eric; Cecka, Cris; Takahashi, Toru

    2011-02-01

    In this article, we discuss how the fast multipole method (FMM) can be implemented on modern parallel computers, ranging from computer clusters to multicore processors and graphics cards (GPU). The FMM is a somewhat difficult application for parallel computing because of its tree structure and the fact that it requires many complex operations which are not regularly structured. Computational linear algebra with dense matrices for example allows many optimizations that leverage the regular computation pattern. FMM can be similarly optimized but we will see that the complexity of the optimization steps is greater. The discussion will start with a general presentation of FMMs. We briefly discuss parallel methods for the FMM, such as building the FMM tree in parallel, and reducing communication during the FMM procedure. Finally, we will focus on porting and optimizing the FMM on GPUs.

  19. Implementation of context independent code on a new array processor: The Super-65

    NASA Technical Reports Server (NTRS)

    Colbert, R. O.; Bowhill, S. A.

    1981-01-01

    The feasibility of rewriting standard uniprocessor programs into code which contains no context-dependent branches is explored. Context independent code (CIC) would contain no branches that might require different processing elements to branch different ways. In order to investigate the possibilities and restrictions of CIC, several programs were recoded into CIC and a four-element array processor was built. This processor (the Super-65) consisted of three 6502 microprocessors and the Apple II microcomputer. The results obtained were somewhat dependent upon the specific architecture of the Super-65 but within bounds, the throughput of the array processor was found to increase linearly with the number of processing elements (PEs). The slope of throughput versus PEs is highly dependent on the program and varied from 0.33 to 1.00 for the sample programs.

  20. Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition

    DOEpatents

    Tomkins, James L.; Camp, William J.

    2007-07-17

    A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

  1. Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Gibson, Garth Alan

    1990-01-01

    During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

  2. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    PubMed Central

    Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard

    2004-01-01

    Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope. PMID:15357879

  3. DC simulator of large-scale nonlinear systems for parallel processors

    NASA Astrophysics Data System (ADS)

    Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel

    In this paper it is shown how the idea of the BBD decomposition of large-scale nonlinear systems can be implemented in a parallel DC circuit simulation algorithm. Usually, the BBD nonlinear circuits decomposition was used together with the multi-level Newton-Raphson iterative process. We propose the simulation consisting in the circuit decomposition and the process parallelization on the single level only. This block-parallel approach may give a considerable profit in simulation time though it is strongly dependent on the system topology and, of course, on the processor type. The paper presents the architecture of the decomposition-based algorithm, explains details of its implementation, including two steps of the one level bypassing techniques and discusses a construction of the dedicated benchmarks for this simulation software.

  4. Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies

    NASA Astrophysics Data System (ADS)

    Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio

    2011-11-01

    Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.

  5. Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors

    SciTech Connect

    Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K

    2010-01-01

    An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

  6. Series-parallel method of direct solar array regulation

    NASA Technical Reports Server (NTRS)

    Gooder, S. T.

    1976-01-01

    A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

  7. Inventory estimation on the massively parallel processor. [from satellite based images

    NASA Technical Reports Server (NTRS)

    Argentiero, P. D.; Strong, J. P.; Koch, D. W.

    1980-01-01

    This paper describes algorithms for efficiently computing inventory estimates from satellite based images. The algorithms incorporate a one dimensional feature extraction which optimizes the pairwise sum of Fisher distances. Biases are eliminated with a premultiplication by the inverse of the analytically derived error matrix. The technique is demonstrated with a numerical example using statistics obtained from an actual Landsat scene. Attention was given to implementation of the Massively Parallel processor (MPP). A timing analysis demonstrates that the inventory estimation can be performed an order of magnitude faster on the MPP than on a conventional serial machine.

  8. Estimating water flow through a hillslope using the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Devaney, Judy E.; Camillo, P. J.; Gurney, R. J.

    1988-01-01

    A new two-dimensional model of water flow in a hillslope has been implemented on the Massively Parallel Processor at the Goddard Space Flight Center. Flow in the soil both in the saturated and unsaturated zones, evaporation and overland flow are all modelled, and the rainfall rates are allowed to vary spatially. Previous models of this type had always been very limited computationally. This model takes less than a minute to model all the components of the hillslope water flow for a day. The model can now be used in sensitivity studies to specify which measurements should be taken and how accurate they should be to describe such flows for environmental studies.

  9. Block iterative restoration of astronomical images with the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Heap, Sara R.; Lindler, Don J.

    1987-01-01

    A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

  10. Ferroelectric/Optoelectronic Memory/Processor

    NASA Technical Reports Server (NTRS)

    Thakoor, Sarita; Thakoor, Anilkumar P.

    1992-01-01

    Proposed hybrid optoelectronic nonvolatile analog memory and data processor comprises planar array of microscopic photosensitive ferroelectric capacitors performing massively parallel analog computations. Processors overcome electronic crosstalk and limitations on number of input/output contacts inherent in electronic implementations of large interconnection arrays. Used in general optical computing, recognition of patterns, and artificial neural networks.

  11. Transformation from C-program to circuitry for a dynamically reconfigurable cell array processor

    NASA Astrophysics Data System (ADS)

    Morishita, Takayuki; Komoku, Kiyotaka; Hatano, Fumihiro; Teramoto, Iwao

    2001-07-01

    We have been developing a parallel processor that it is possible to reconfigure hardware according to a software. Dynamic Reconfiguration means to change a kind of and a number of processing elements and connection between processing elements at real time. Our proposed processor creates a very long pipeline, which is able to execute for-loop calculation at very high speed. In this paper, we develop an algorithm which transform automatically a c-language program to a circuit diagram. Especially, we consider processing method of if-sentence and for-sentence and realize high-performance processing of them by a pipeline processing. The automatic transforming program is created by c-language. Finally, we examine a performance of this processor by using a MPEG decoding program.

  12. Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

    NASA Astrophysics Data System (ADS)

    Gan, Ge; Wang, Xu; Manzano, Joseph; Gao, Guang R.

    Programming a multicore processor is difficult. It is even more difficult if the processor has software-managed memory hierarchy, e.g. the IBM Cyclops-64 (C64). A widely accepted parallel programming solution for multicore processor is OpenMP. Currently, all OpenMP directives are only used to decompose computation code (such as loop iterations, tasks, code sections, etc.). None of them can be used to control data movement, which is crucial for the C64 performance. In this paper, we propose a technique called tile percolation. This method provides the programmer with a set of OpenMP pragma directives. The programmer can use these directives to annotate their program to specify where and how to perform data movement. The compiler will then generate the required code accordingly. Our method is a semi-automatic code generation approach intended to simplify a programmer’s work. The paper provides (a) an exploration of the possibility of developing pragma directives for semi-automatic data movement code generation in OpenMP; (b) an introduction of techniques used to implement tile percolation including the programming API, the code generation in compiler, and the required runtime support routines; (c) and an evaluation of tile percolation with a set of benchmarks. Our experimental results show that tile percolation can make the OpenMP programs run on the C64 chip more efficiently.

  13. A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors

    PubMed Central

    Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

    2009-01-01

    A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 μm Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 μm × 9.5 μm and each processing element size is 23 μm × 29 μm. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking. PMID:22454565

  14. Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2011-04-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

  15. Animated computer graphics models of space and earth sciences data generated via the massively parallel processor

    NASA Technical Reports Server (NTRS)

    Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

    1987-01-01

    The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

  16. Application of an array processor to the analysis of magnetic data for the Doublet III tokamak

    SciTech Connect

    Wang, T.S.; Saito, M.T.

    1980-08-01

    Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed.

  17. Parallel scheduling of recursively defined arrays

    NASA Technical Reports Server (NTRS)

    Myers, T. J.; Gokhale, M. B.

    1986-01-01

    A new method of automatic generation of concurrent programs which constructs arrays defined by sets of recursive equations is described. It is assumed that the time of computation of an array element is a linear combination of its indices, and integer programming is used to seek a succession of hyperplanes along which array elements can be computed concurrently. The method can be used to schedule equations involving variable length dependency vectors and mutually recursive arrays. Portions of the work reported here have been implemented in the PS automatic program generation system.

  18. An introduction to coil array design for parallel MRI.

    PubMed

    Ohliger, Michael A; Sodickson, Daniel K

    2006-05-01

    The basic principles of radiofrequency coil array design for parallel MRI are described from both theoretical and practical perspectives. Because parallel MRI techniques rely on coil array sensitivities to provide spatial information about the sample, a careful choice of array design is essential. The concepts of coil array spatial encoding are first discussed from four qualitative perspectives. These qualitative descriptions include using coil arrays to emulate spatial harmonics, choosing coils with selective sensitivities to aliased pixels, using coil sensitivities with broad k-space reception profiles, and relying on detector coils to provide a set of generalized projections of the sample. This qualitative discussion is followed by a quantitative analysis of coil arrays, which is discussed in terms of the baseline SNR of the received images as well as the noise amplifications (g-factor) in the reconstructed data. The complications encountered during the experimental evaluation of coil array SNR are discussed, and solutions are proposed. A series of specific array designs are reviewed, with an emphasis on the general design considerations that motivate each approach. Finally, a set of special topics is discussed, which reflect issues that have become important, especially as arrays are being designed for more high-performance applications of parallel MRI. These topics include concerns about the depth penetration of arrays composed of small elements, the use of adaptive arrays for systems with limited receiver channels, the management of inductive coupling between array elements, and special considerations required at high field strengths. The fundamental limits of spatial encoding using coil arrays are discussed, with a primary emphasis on how the determination of these limits impacts the design of optimized arrays. This review is intended to provide insight into how arrays are currently used for parallel MRI and to place into context the new innovations that are

  19. Parallel collective resonances in arrays of gold nanorods.

    PubMed

    Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

    2014-01-01

    In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically. PMID:24645987

  20. Parallel collective resonances in arrays of gold nanorods.

    PubMed

    Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

    2014-01-01

    In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically.

  1. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    NASA Astrophysics Data System (ADS)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  2. Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Dimpsey, Robert Tod

    1992-01-01

    In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.

  3. Optoelectronic implementation of a 256-channel sonar adaptive-array processor.

    PubMed

    Silveira, Paulo E X; Pati, Gour S; Wagner, Kelvin H

    2004-12-10

    We present an optoelectronic implementation of an adaptive-array processor that is capable of performing beam forming and jammer nulling in signals of wide fractional bandwidth that are detected by an array of arbitrary topology. The optical system makes use of a two-dimensional scrolling spatial light modulator to represent an array of input signals in 256 tapped delay lines, two acousto-optic modulators for modulating the feedback error signal, and a photorefractive crystal for representing the adaptive weights as holographic gratings. Gradient-descent learning is used to dynamically adapt the holographic weights to optimally form multiple beams and to null out multiple interference sources, either in the near field or in the far field. Space-integration followed by differential heterodyne detection is used for generating the system's output. The processor is analyzed to show the effects of exponential weight decay on the optimum solution and on the convergence conditions. Several experimental results are presented that validate the system's capacity for broadband beam forming and jammer nulling for linear and circular arrays.

  4. A processor-time-minimal systolic array for cubical mesh algorithms

    SciTech Connect

    Cappello, P. . Dept. of Computer Science)

    1992-01-01

    Using a directed acyclic graph (dag) model of algorithms, the paper focuses on time-minimal multiprocessor schedules that use as few processors as possible. Such a processor-time-minimal scheduling of an algorithm's dag first is illustrated using a triangular shaped 2-D directed mesh (representing, for example, an algorithm for solving a triangular system of linear equations). Then, algorithms represented by an n {times} n {times} n directed mesh are investigated. This cubical directed mesh is fundamental; it represents the standard algorithm for computing matrix product as well as many other algorithms. Completion of the cubical mesh requires 3n - 2 steps. It is shown that the number of processing elements needed to achieve this time bound is at least (3n{sup 2/4}). A systolic array for the cubical directed mesh is then presented. It completes the mesh using the minimum number of steps and exactly (3n{sup 2/4}) processing elements: it is processor-time-minimal. The systolic array's topology is that of a hexagonally shaped, cylindrically- connected, 2-D directed mesh.

  5. High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

    DOEpatents

    Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.

    2002-01-01

    As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

  6. Parallel Spectral Acquisition with an Ion Cyclotron Resonance Cell Array.

    PubMed

    Park, Sung-Gun; Anderson, Gordon A; Navare, Arti T; Bruce, James E

    2016-01-19

    Mass measurement accuracy is a critical analytical figure-of-merit in most areas of mass spectrometry application. However, the time required for acquisition of high-resolution, high mass accuracy data limits many applications and is an aspect under continual pressure for development. Current efforts target implementation of higher electrostatic and magnetic fields because ion oscillatory frequencies increase linearly with field strength. As such, the time required for spectral acquisition of a given resolving power and mass accuracy decreases linearly with increasing fields. Mass spectrometer developments to include multiple high-resolution detectors that can be operated in parallel could further decrease the acquisition time by a factor of n, the number of detectors. Efforts described here resulted in development of an instrument with a set of Fourier transform ion cyclotron resonance (ICR) cells as detectors that constitute the first MS array capable of parallel high-resolution spectral acquisition. ICR cell array systems consisting of three or five cells were constructed with printed circuit boards and installed within a single superconducting magnet and vacuum system. Independent ion populations were injected and trapped within each cell in the array. Upon filling the array, all ions in all cells were simultaneously excited and ICR signals from each cell were independently amplified and recorded in parallel. Presented here are the initial results of successful parallel spectral acquisition, parallel mass spectrometry (MS) and MS/MS measurements, and parallel high-resolution acquisition with the MS array system.

  7. Parallel Spectral Acquisition with an Ion Cyclotron Resonance Cell Array.

    PubMed

    Park, Sung-Gun; Anderson, Gordon A; Navare, Arti T; Bruce, James E

    2016-01-19

    Mass measurement accuracy is a critical analytical figure-of-merit in most areas of mass spectrometry application. However, the time required for acquisition of high-resolution, high mass accuracy data limits many applications and is an aspect under continual pressure for development. Current efforts target implementation of higher electrostatic and magnetic fields because ion oscillatory frequencies increase linearly with field strength. As such, the time required for spectral acquisition of a given resolving power and mass accuracy decreases linearly with increasing fields. Mass spectrometer developments to include multiple high-resolution detectors that can be operated in parallel could further decrease the acquisition time by a factor of n, the number of detectors. Efforts described here resulted in development of an instrument with a set of Fourier transform ion cyclotron resonance (ICR) cells as detectors that constitute the first MS array capable of parallel high-resolution spectral acquisition. ICR cell array systems consisting of three or five cells were constructed with printed circuit boards and installed within a single superconducting magnet and vacuum system. Independent ion populations were injected and trapped within each cell in the array. Upon filling the array, all ions in all cells were simultaneously excited and ICR signals from each cell were independently amplified and recorded in parallel. Presented here are the initial results of successful parallel spectral acquisition, parallel mass spectrometry (MS) and MS/MS measurements, and parallel high-resolution acquisition with the MS array system. PMID:26669509

  8. Parallel Access of Out-Of-Core Dense Extendible Arrays

    SciTech Connect

    Otoo, Ekow J; Rotem, Doron

    2007-07-26

    Datasets used in scientific and engineering applications are often modeled as dense multi-dimensional arrays. For very large datasets, the corresponding array models are typically stored out-of-core as array files. The array elements are mapped onto linear consecutive locations that correspond to the linear ordering of the multi-dimensional indices. Two conventional mappings used are the row-major order and the column-major order of multi-dimensional arrays. Such conventional mappings of dense array files highly limit the performance of applications and the extendibility of the dataset. Firstly, an array file that is organized in say row-major order causes applications that subsequently access the data in column-major order, to have abysmal performance. Secondly, any subsequent expansion of the array file is limited to only one dimension. Expansions of such out-of-core conventional arrays along arbitrary dimensions, require storage reorganization that can be very expensive. Wepresent a solution for storing out-of-core dense extendible arrays that resolve the two limitations. The method uses a mapping function F*(), together with information maintained in axial vectors, to compute the linear address of an extendible array element when passed its k-dimensional index. We also give the inverse function, F-1*() for deriving the k-dimensional index when given the linear address. We show how the mapping function, in combination with MPI-IO and a parallel file system, allows for the growth of the extendible array without reorganization and no significant performance degradation of applications accessing elements in any desired order. We give methods for reading and writing sub-arrays into and out of parallel applications that run on a cluster of workstations. The axial-vectors are replicated and maintained in each node that accesses sub-array elements.

  9. Feasibility of using the Massively Parallel Processor for large eddy simulations and other Computational Fluid Dynamics applications

    NASA Technical Reports Server (NTRS)

    Bruno, John

    1984-01-01

    The results of an investigation into the feasibility of using the MPP for direct and large eddy simulations of the Navier-Stokes equations is presented. A major part of this study was devoted to the implementation of two of the standard numerical algorithms for CFD. These implementations were not run on the Massively Parallel Processor (MPP) since the machine delivered to NASA Goddard does not have sufficient capacity. Instead, a detailed implementation plan was designed and from these were derived estimates of the time and space requirements of the algorithms on a suitably configured MPP. In addition, other issues related to the practical implementation of these algorithms on an MPP-like architecture were considered; namely, adaptive grid generation, zonal boundary conditions, the table lookup problem, and the software interface. Performance estimates show that the architectural components of the MPP, the Staging Memory and the Array Unit, appear to be well suited to the numerical algorithms of CFD. This combined with the prospect of building a faster and larger MMP-like machine holds the promise of achieving sustained gigaflop rates that are required for the numerical simulations in CFD.

  10. Some current uses of array processors for preprocessing of remote sensing data

    NASA Technical Reports Server (NTRS)

    Fischel, D.

    1984-01-01

    The preparation of remotely sensed data sets into a form useful to the analyst is a significant computational task, involving the processing of spacecraft data (e.g., orbit, attitude, temperatures, etc.), decommutation of the video telemetry stream, radiometric correction and geometric correction. Many of these processes are extremely well suited for implementation on attached array processors. Currently, at Goddard Space Flight Center a number of computer systems provide such capability for earth observations or are under development as test beds for future ground segment support. Six such systems will be discussed.

  11. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    DOE PAGESBeta

    O'keefe, Matthew; Parr, Terence; Edgar, B. Kevin; Anderson, Steve; Woodward, Paul; Dietz, Hank

    1995-01-01

    Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. Wemore » have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.« less

  12. Parallel arrays of Josephson junctions for submillimeter local oscillators

    NASA Technical Reports Server (NTRS)

    Pance, Aleksandar; Wengler, Michael J.

    1992-01-01

    In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

  13. An Analog Processor for Image Compression

    NASA Technical Reports Server (NTRS)

    Tawel, R.

    1992-01-01

    This paper describes a novel analog Vector Array Processor (VAP) that was designed for use in real-time and ultra-low power image compression applications. This custom CMOS processor is based architectually on the Vector Quantization (VQ) algorithm in image coding, and the hardware implementation fully exploits the inherent parallelism built-in the VQ algorithm.

  14. NOSC (Naval Ocean Systems Center) advanced systolic array processor (ASAP). Professional paper for period ending August 1987

    SciTech Connect

    Loughlin, J.P.

    1987-12-01

    Design of a high-speed (250 million 32-bit floating-point operations per second) two-dimensional systolic array composed of 16-bit/slice microsequencer structured processors is presented. System-design features such as broadcast data flow, tag bit movement, and integrated diagnostic test registers are described. The software development tools needed to map complex matrix-based signal-processing algorithms onto the systolic-processor system are described.

  15. Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor

    NASA Technical Reports Server (NTRS)

    Field, E. I.

    1975-01-01

    The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.

  16. Constructing higher order DNA origami arrays using DNA junctions of anti-parallel/parallel double crossovers

    NASA Astrophysics Data System (ADS)

    Ma, Zhipeng; Park, Seongsu; Yamashita, Naoki; Kawai, Kentaro; Hirai, Yoshikazu; Tsuchiya, Toshiyuki; Tabata, Osamu

    2016-06-01

    DNA origami provides a versatile method for the construction of nanostructures with defined shape, size and other properties; such nanostructures may enable a hierarchical assembly of large scale architecture for the placement of other nanomaterials with atomic precision. However, the effective use of these higher order structures as functional components depends on knowledge of their assembly behavior and mechanical properties. This paper demonstrates construction of higher order DNA origami arrays with controlled orientations based on the formation of two types of DNA junctions: anti-parallel and parallel double crossovers. A two-step assembly process, in which preformed rectangular DNA origami monomer structures themselves undergo further self-assembly to form numerically unlimited arrays, was investigated to reveal the influences of assembly parameters. AFM observations showed that when parallel double crossover DNA junctions are used, the assembly of DNA origami arrays occurs with fewer monomers than for structures formed using anti-parallel double crossovers, given the same assembly parameters, indicating that the configuration of parallel double crossovers is not energetically preferred. However, the direct measurement by AFM force-controlled mapping shows that both DNA junctions of anti-parallel and parallel double crossovers have homogeneous mechanical stability with any part of DNA origami.

  17. Mobile and replicated alignment of arrays in data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  18. Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems

    NASA Technical Reports Server (NTRS)

    Das, T. P.

    1988-01-01

    The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

  19. Parallel Processing of Large Scale Microphone Arrays for Sound Capture

    NASA Astrophysics Data System (ADS)

    Jan, Ea-Ee.

    1995-01-01

    Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be

  20. Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.

    PubMed

    Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele

    2015-01-01

    Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable. PMID:26737215

  1. Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.

    PubMed

    Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele

    2015-01-01

    Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable.

  2. Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor

    NASA Technical Reports Server (NTRS)

    Moore, J. Strother

    1992-01-01

    Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

  3. Microchannel cross load array with dense parallel input

    DOEpatents

    Swierkowski, Stefan P.

    2004-04-06

    An architecture or layout for microchannel arrays using T or Cross (+) loading for electrophoresis or other injection and separation chemistry that are performed in microfluidic configurations. This architecture enables a very dense layout of arrays of functionally identical shaped channels and it also solves the problem of simultaneously enabling efficient parallel shapes and biasing of the input wells, waste wells, and bias wells at the input end of the separation columns. One T load architecture uses circular holes with common rows, but not columns, which allows the flow paths for each channel to be identical in shape, using multiple mirror image pieces. Another T load architecture enables the access hole array to be formed on a biaxial, collinear grid suitable for EDM micromachining (square holes), with common rows and columns.

  4. Parallel Syntheses of Peptides on Teflon-Patterned Paper Arrays (SyntArrays).

    PubMed

    Deiss, Frédérique; Yang, Yang; Derda, Ratmir

    2016-01-01

    Screening of peptides to find the ligands that bind to specific targets is an important step in drug discovery. These high-throughput screens require large number of structural variants of peptides to be synthesized and tested. This chapter describes the generation of arrays of peptides on Teflon-patterned sheets of paper. First, the protocol describes the patterning of paper with a Teflon solution to produce arrays with solvophobic barriers that are able to confine organic solvents. Next, we describe the parallel syntheses of 96 peptides on Teflon-patterned arrays using the SPOT synthesis method. PMID:26614081

  5. Parallel Syntheses of Peptides on Teflon-Patterned Paper Arrays (SyntArrays).

    PubMed

    Deiss, Frédérique; Yang, Yang; Derda, Ratmir

    2016-01-01

    Screening of peptides to find the ligands that bind to specific targets is an important step in drug discovery. These high-throughput screens require large number of structural variants of peptides to be synthesized and tested. This chapter describes the generation of arrays of peptides on Teflon-patterned sheets of paper. First, the protocol describes the patterning of paper with a Teflon solution to produce arrays with solvophobic barriers that are able to confine organic solvents. Next, we describe the parallel syntheses of 96 peptides on Teflon-patterned arrays using the SPOT synthesis method.

  6. Medical ultrasound digital beamforming on a massively parallel processing array platform

    NASA Astrophysics Data System (ADS)

    Chen, Paul; Butts, Mike; Budlong, Brad

    2008-03-01

    Digital beamforming has been widely used in modern medical ultrasound instruments. Flexibility is the key advantage of a digital beamformer over the traditional analog approach. Unlike analog delay lines, digital delay can be programmed to implement new ways of beam shaping and beam steering without hardware modification. Digital beamformers can also be focused dynamically by tracking the depth and focusing the receive beam as the depth increases. By constantly updating an element weight table, a digital beamformer can dynamically increase aperture size with depth to maintain constant lateral resolution and reduce sidelobe noise. Because ultrasound digital beamformers have high I/O bandwidth and processing requirements, traditionally they have been implemented using ASICs or FPGAs that are costly both in time and in money. This paper introduces a sample implementation of a digital beamformer that is programmed in software on a Massively Parallel Processor Array (MPPA). The system consists of a host PC and a PCI Express-based beamformer accelerator with an Ambric Am2045 MPPA chip and 512 Mbytes of external memory. The Am2045 has 336 asynchronous RISCDSP processors that communicate through a configurable structure of channels, using a self-synchronizing communication protocol.

  7. Scalable Unix commands for parallel processors : a high-performance implementation.

    SciTech Connect

    Ong, E.; Lusk, E.; Gropp, W.

    2001-06-22

    We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

  8. Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986

    SciTech Connect

    Golub, G.H.

    1987-04-30

    The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

  9. Breast ultrasound tomography with two parallel transducer arrays

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Gao, Kai; Intrator, Miranda; Hanson, Kenneth

    2016-03-01

    Breast ultrasound tomography is an emerging imaging modality to reconstruct the sound speed, density, and ultrasound attenuation of the breast in addition to ultrasound reflection/beamforming images for breast cancer detection and characterization. We recently designed and manufactured a new synthetic-aperture breast ultrasound tomography prototype with two parallel transducer arrays consisting of a total of 768 transducer elements. The transducer arrays are translated vertically to scan the breast in a warm water tank from the chest wall/axillary region to the nipple region to acquire ultrasound transmission and reflection data for whole-breast ultrasound tomography imaging. The distance of these two ultrasound transducer arrays is adjustable for scanning breasts with different sizes. We use our breast ultrasound tomography prototype to acquire phantom and in vivo patient ultrasound data to study its feasibility for breast imaging. We apply our recently developed ultrasound imaging and tomography algorithms to ultrasound data acquired using our breast ultrasound tomography system. Our in vivo patient imaging results demonstrate that our breast ultrasound tomography can detect breast lesions shown on clinical ultrasound and mammographic images.

  10. QLISP for parallel processors. Final report, 15 July 1986-31 July 1988

    SciTech Connect

    McCarthy, J.

    1989-01-01

    The goal of the QLISP project at Stanford is to gain experience with the shared-memory, queue-based approach to parallel Lisp, by implementing the QLISP language on an actual multiprocessor, and by developing a symbolic algebra system as a testbed application. The experiments performed on the simulator included: 1. Algorithms for sorting and basic data-structure manipulation for polynomials. 2. Partitioning and scheduling methods for parallel programming. 3. Parallelizing the production rule system OPS5.

  11. O(1) time algorithms for computing histogram and Hough transform on a cross-bridge reconfigurable array of processors

    SciTech Connect

    Kao, T.; Horng, S.; Wang, Y.

    1995-04-01

    Instead of using the base-2 number system, we use a base-m number system to represent the numbers used in the proposed algorithms. Such a strategy can be used to design an O(T) time, T = (log(sub m) N) + 1, prefix sum algorithm for a binary sequence with N-bit on a cross-bridge reconfigurable array of processors using N processors, where the data bus is m-bit wide. Then, this basic operation can be used to compute the histogram of an n x n image with G gray-level value in constant time using G x n x n processors, and compute the Hough transform of an image with N edge pixels and n x n parameter space in constant time using n x n x N processors, respectively. This result is better than the previously known results proposed in the literature. Also, the execution time of the proposed algorithms is tunable by the bus bandwidth. 43 refs.

  12. A longitudinal multi-bunch feedback system using parallel digital signal processors

    SciTech Connect

    Sapozhnikov, L.; Fox, J.D.; Olsen, J.J.; Oxoby, G.; Linscott, I.; Drago, A.; Serio, M.

    1993-12-01

    A programmable longitudinal feedback system based on four AT&T 1610 digital signal processors has been developed as a component of the PEP-II R&D program. This longitudinal quick prototype is a proof of concept for the PEP-II system and implements full-speed bunch-by-bunch signal processing for storage rings with bunch spacing of 4 ns. The design incorporates a phase-detector-based front end that digitizes the oscillation phases of bunchies at the 250 MHz crossing rate, four programmable signal processors that compute correction signals, and a 250-MHz hold buffer/kicker driver stage that applies correction signals back on the beam. The design implements a general-purpose, table-driven downsampler that allows the system to be operated at several accelerator facilities. The hardware architecture of the signal processing is described, and the software algorithms used in the feedback signal computation are discussed. The system configuration used for tests at the LBL Advanced Light Source is presented.

  13. Microfluidic trap array for massively parallel imaging of Drosophila embryos.

    PubMed

    Levario, Thomas J; Zhan, Mei; Lim, Bomyi; Shvartsman, Stanislav Y; Lu, Hang

    2013-04-01

    Here we describe a protocol for the fabrication and use of a microfluidic device to rapidly orient >700 Drosophila embryos in parallel for end-on imaging. The protocol describes master microfabrication (∼1 d), polydimethylsiloxane molding (few hours), system setup and device operation (few minutes) and imaging (depending on application). Our microfluidics-based approach described here is one of the first to facilitate rapid orientation for end-on imaging, and it is a major breakthrough for quantitative studies on Drosophila embryogenesis. The operating principle of the embryo trap is based on passive hydrodynamics, and it does not require direct manipulation of embryos by the user; biologists following the protocol should be able to repeat these procedures. The compact design and fabrication materials used allow the device to be used with traditional microscopy setups and do not require specialized fixtures. Furthermore, with slight modification, this array can be applied to the handling of other model organisms and oblong objects. PMID:23493069

  14. Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.

    SciTech Connect

    Learn, Mark Walter

    2012-01-01

    The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

  15. Fast String Search on Multicore Processors: Mapping fundamental algorithms onto parallel hardware

    SciTech Connect

    Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

    2008-04-01

    String searching is one of these basic algorithms. It has a host of applications, including search engines, network intrusion detection, virus scanners, spam filters, and DNA analysis, among others. The Cell processor, with its multiple cores, promises to speed-up string searching a lot. In this article, we show how we mapped string searching efficiently on the Cell. We present two implementations: • The fast implementation supports a small dictionary size (approximately 100 patterns) and provides a throughput of 40 Gbps, which is 100 times faster than reference implementations on x86 architectures. • The heavy-duty implementation is slower (3.3-4.3 Gbps), but supports dictionaries with tens of thousands of strings.

  16. Low-power, real-time digital video stabilization using the HyperX parallel processor

    NASA Astrophysics Data System (ADS)

    Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

    2011-06-01

    Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

  17. Multimode power processor

    DOEpatents

    O'Sullivan, G.A.; O'Sullivan, J.A.

    1999-07-27

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

  18. Multimode power processor

    DOEpatents

    O'Sullivan, George A.; O'Sullivan, Joseph A.

    1999-01-01

    In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

  19. Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays

    PubMed Central

    Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin

    2016-01-01

    In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301

  20. Fast 2D DOA Estimation Algorithm by an Array Manifold Matching Method with Parallel Linear Arrays.

    PubMed

    Yang, Lisheng; Liu, Sheng; Li, Dong; Jiang, Qingping; Cao, Hailin

    2016-01-01

    In this paper, the problem of two-dimensional (2D) direction-of-arrival (DOA) estimation with parallel linear arrays is addressed. Two array manifold matching (AMM) approaches, in this work, are developed for the incoherent and coherent signals, respectively. The proposed AMM methods estimate the azimuth angle only with the assumption that the elevation angles are known or estimated. The proposed methods are time efficient since they do not require eigenvalue decomposition (EVD) or peak searching. In addition, the complexity analysis shows the proposed AMM approaches have lower computational complexity than many current state-of-the-art algorithms. The estimated azimuth angles produced by the AMM approaches are automatically paired with the elevation angles. More importantly, for estimating the azimuth angles of coherent signals, the aperture loss issue is avoided since a decorrelation procedure is not required for the proposed AMM method. Numerical studies demonstrate the effectiveness of the proposed approaches. PMID:26907301

  1. Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

    SciTech Connect

    Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

    2005-09-20

    At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

  2. Representing S-expressions for the efficient evaluation of Lisp on parallel processors

    SciTech Connect

    Harrison, W.L. III; Padua, D.A.

    1986-03-01

    Present methods for exploiting parallelism in Lisp programs perform poorly upon lists (long, flat s-expressions), as such structures must be both created and traversed sequentially. While such a serial operation may be masked by overlapping it with other computation (by virtue of process spawning, or by the use of a mechanism such as futures), it represents a lost (and potentially large) source of parallelism. In this paper we describe the representation of s-expressions employed in PARCEL (Project for the Automatic Restructuring and Concurrent Evaluation of Lisp), which facilitates the creation and access of lists, without compromising the performance of functions which manipulate s-expressions of a more general shape. Using this representation, the PARCEL compiler translates Lisp programs written in a subset of the Scheme dialect (which allows for global variables and atom properties) into code for a large, tightly coupled shared memory multiprocessor. 12 refs.

  3. Block-Level Added Redundancy Explicit Authentication for Parallelized Encryption and Integrity Checking of Processor-Memory Transactions

    NASA Astrophysics Data System (ADS)

    Elbaz, Reouven; Torres, Lionel; Sassatelli, Gilles; Guillemin, Pierre; Bardouillet, Michel; Martinez, Albert

    The bus between the System on Chip (SoC) and the external memory is one of the weakest points of computer systems: an adversary can easily probe this bus in order to read private data (data confidentiality concern) or to inject data (data integrity concern). The conventional way to protect data against such attacks and to ensure data confidentiality and integrity is to implement two dedicated engines: one performing data encryption and another data authentication. This approach, while secure, prevents parallelizability of the underlying computations. In this paper, we introduce the concept of Block-Level Added Redundancy Explicit Authentication (BL-AREA) and we describe a Parallelized Encryption and Integrity Checking Engine (PE-ICE) based on this concept. BL-AREA and PE-ICE have been designed to provide an effective solution to ensure both security services while allowing for full parallelization on processor read and write operations and optimizing the hardware resources. Compared to standard encryption which ensures only confidentiality, we show that PE-ICE additionally guarantees code and data integrity for less than 4% of run-time performance overhead.

  4. Trajectory optimization for real-time guidance. I - Time-varying LQR on a parallel processor

    NASA Technical Reports Server (NTRS)

    Psiaki, Mark L.; Park, Kihong

    1990-01-01

    A key algorithmic element of a real-time trajectory optimization hardware/software implementation, the quadratic program (QP) solver element, is presented. The purpose of the effort is to make nonlinear trajectory optimization fast enough to provide real-time commands during guidance of a vehicle such as an aeromaneuvering orbiter. Many methods of nonlinear programming require the solution of a QP at each iteration. In the trajectory optimization case the QP has a special dynamic programming structure, a LQR-like structure. QP algorithm speed is increased by taking advantage of this special structure and by parallel implementation.

  5. Optimizing ion channel models using a parallel genetic algorithm on graphical processors.

    PubMed

    Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon

    2012-01-01

    We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs.

  6. Multimedia OC12 parallel interface using VCSEL array to achieve high-performance cost-effective optical interconnections

    NASA Astrophysics Data System (ADS)

    Chang, Edward S.

    1996-09-01

    The multimedia communication needs high-performance, cost- effective communication techniques to transport data for the fast-growing multimedia traffic resulting from the recent deployment of World Wide Web (WWW), media-on-demand , and other multimedia applications. To transport a large volume, of multimedia data, high-performance servers are required to perform media processing and transfer. Typically, the high- performance multimedia server is a massively parallel processor with a high number of I/O ports, high storage capacity, fast signal processing, and excellent cost- performance. The parallel I/O ports of the server are connected to multiple clients through a network switch which uses parallel links in both switch-to-server and switch-to- client connections. In addition to media processing and storage, media communication is also a major function of the multimedia system. Without a high-performance communication network, a high-performance server can not deliver its full capacity of service to clients. Fortunately, there are many advanced communication technologies developed for networking, which can be adopted by the multimedia communication to economically deliver the full capacity of a high-performance multimedia service to clients. The VCSEL array technology has been developed for gigabit-rate parallel optical interconnections because of its high bandwidth, small-size, and easy-fabrication advantages. Several firms are developing multifiber, low-skew, low-cost ribbon cables to transfer signals form a VCSEL array. The OC12 SONET data-rate is widely used by high-performance multimedia communications for its high-data-rate and cost- effectiveness. Therefore, the OC12 VCSEL parallel optical interconnection is the ideal technology to meet the high- performance low-cost requirements for delivering affordable multimedia services to mass users. This paper describes a multimedia OC12 parallel optical interconnection using a VCSEL array transceiver, a multifiber

  7. Preparation of ZnO/ZnSe heterostructure parallel arrays for photodetector application

    NASA Astrophysics Data System (ADS)

    Xiao, Chuanhai; Wang, Yuda; Yang, Tianye; Luo, Yang; Zhang, Mingzhe

    2016-07-01

    ZnO/ZnSe heterostructure parallel arrays on glass substrate were prepared through ultrathin layers electrodeposition method combining with annealing treatment. There are two essential factors for the formation of such kind of parallel arrays: the periodical change of charges and ions concentration, and the mutual equilibrium of electric repulsion at the growth front. The research for photoresponse characteristics of the heterostructure arrays demonstrates a UV/visible broad spectral response.

  8. Method of up-front load balancing for local memory parallel processors

    NASA Technical Reports Server (NTRS)

    Baffes, Paul Thomas (Inventor)

    1990-01-01

    In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

  9. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    NASA Technical Reports Server (NTRS)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  10. Performance evaluation of the JPL interim digital SAR processor

    NASA Technical Reports Server (NTRS)

    Wu, C.; Barkan, B.; Curlander, J.; Jin, M.; Pang, S.

    1983-01-01

    The performance of the Interim Digital SAR Processor (IDP) was evaluated. The IDP processor was originally developed for experimental processing of digital SEASAT SAR data. One phase of the system upgrade which features parallel processing in three peripheral array processors, automated estimation for Doppler parameters, and unsupervised image pixel location determination and registration was executed. The method to compensate for the target range curvature effect was improved. A four point interpolation scheme is implemented to replace the nearest neighbor scheme used in the original IDP. The processor still maintains its fast throughput speed. The current performance and capability of the processing modes now available on the IDP system are updated.

  11. Computer Processor Allocator

    2004-03-01

    The Compute Processor Allocator (CPA) provides an efficient and reliable mechanism for managing and allotting processors in a massively parallel (MP) computer. It maintains information in a database on the health. configuration and allocation of each processor. This persistent information is factored in to each allocation decision. The CPA runs in a distributed fashion to avoid a single point of failure.

  12. Massively parallel computation of lattice associative memory classifiers on multicore processors

    NASA Astrophysics Data System (ADS)

    Ritter, Gerhard X.; Schmalz, Mark S.; Hayden, Eric T.

    2011-09-01

    Over the past quarter century, concepts and theory derived from neural networks (NNs) have featured prominently in the literature of pattern recognition. Implementationally, classical NNs based on the linear inner product can present performance challenges due to the use of multiplication operations. In contrast, NNs having nonlinear kernels based on Lattice Associative Memories (LAM) theory tend to concentrate primarily on addition and maximum/minimum operations. More generally, the emergence of LAM-based NNs, with their superior information storage capacity, fast convergence and training due to relatively lower computational cost, as well as noise-tolerant classification has extended the capabilities of neural networks far beyond the limited applications potential of classical NNs. This paper explores theory and algorithmic approaches for the efficient computation of LAM-based neural networks, in particular lattice neural nets and dendritic lattice associative memories. Of particular interest are massively parallel architectures such as multicore CPUs and graphics processing units (GPUs). Originally developed for video gaming applications, GPUs hold the promise of high computational throughput without compromising numerical accuracy. Unfortunately, currently-available GPU architectures tend to have idiosyncratic memory hierarchies that can produce unacceptably high data movement latencies for relatively simple operations, unless careful design of theory and algorithms is employed. Advantageously, some GPUs (e.g., the Nvidia Fermi GPU) are optimized for efficient streaming computation (e.g., concurrent multiply and add operations). As a result, the linear or nonlinear inner product structures of NNs are inherently suited to multicore GPU computational capabilities. In this paper, the authors' recent research in lattice associative memories and their implementation on multicores is overviewed, with results that show utility for a wide variety of pattern

  13. Design and numerical evaluation of a volume coil array for parallel MR imaging at ultrahigh fields

    PubMed Central

    Pang, Yong; Wong, Ernest W.H.; Yu, Baiying

    2014-01-01

    In this work, we propose and investigate a volume coil array design method using different types of birdcage coils for MR imaging. Unlike the conventional radiofrequency (RF) coil arrays of which the array elements are surface coils, the proposed volume coil array consists of a set of independent volume coils including a conventional birdcage coil, a transverse birdcage coil, and a helix birdcage coil. The magnetic fluxes of these three birdcage coils are intrinsically cancelled, yielding a highly decoupled volume coil array. In contrast to conventional non-array type volume coils, the volume coil array would be beneficial in improving MR signal-to-noise ratio (SNR) and also gain the capability of implementing parallel imaging. The volume coil array is evaluated at the ultrahigh field of 7T using FDTD numerical simulations, and the g-factor map at different acceleration rates was also calculated to investigate its parallel imaging performance. PMID:24649435

  14. Graphics-processor-unit-based parallelization of optimized baseline wander filtering algorithms for long-term electrocardiography.

    PubMed

    Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf

    2015-06-01

    Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced.

  15. Graphics-processor-unit-based parallelization of optimized baseline wander filtering algorithms for long-term electrocardiography.

    PubMed

    Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf

    2015-06-01

    Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced. PMID:25675449

  16. High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures

    NASA Technical Reports Server (NTRS)

    Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

    1985-01-01

    Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

  17. A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system

    NASA Technical Reports Server (NTRS)

    Olariu, S.; Schwing, J.; Zhang, J.

    1991-01-01

    A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

  18. Building and using a highly parallel programmable logic array

    SciTech Connect

    Gokhale, M.; Holmes, W.; Kopser, A.; Lucas, S.; Minnich, R.; Sweely, D. ); Lopresti, D. )

    1991-01-01

    With a $13,000 two-slot addition called Splash, a Sun workstation can outperform a Cray-2 on certain applications. Several applications, most involving bit-stream computations, have been run on Splash, which received a 1989 Gordon Bell Prize honorable mention for timings on a problem that compared a new DNA sequence against a library of sequences to find the closest match. In essence, Splash is a programmable linear logic array that can be configured to suit the problem at hand; it bridges the gap between the traditional fixed-function VLSI systolic array and the more versatile programmable array. As originally conceived, a systolic array is a collection of simple processing elements, along with a one- or two-dimensional nearest-neighbor communication pattern. The local nature of the communication gives the systolic array a high communications bandwidth, and the simple, fixed function gives a high packing density for VLSI implementation.

  19. Stream Processors

    NASA Astrophysics Data System (ADS)

    Erez, Mattan; Dally, William J.

    Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.

  20. NEUSORT2.0: a multiple-channel neural signal processor with systolic array buffer and channel-interleaving processing schedule.

    PubMed

    Chen, Tung-Chien; Yang, Zhi; Liu, Wentai; Chen, Liang-Gee

    2008-01-01

    An emerging class of neuroprosthetic devices aims to provide aggressive performance by integrating more complicated signal processing hardware into the neural recording system with a large amount of electrodes. However, the traditional parallel structure duplicating one neural signal processor (NSP) multiple times for multiple channels takes a heavy burden on chip area. The serial structure sequentially switching the processing task between channels requires a bulky memory to store neural data and may has a long processing delay. In this paper, a memory hierarchy of systolic array buffer is proposed to support signal processing interleavingly channel by channel in cycle basis to match up with the data flow of the optimized multiple-channel frontend interface circuitry. The NSP can thus be tightly coupled to the analog frontend interface circuitry and perform signal processing for multiple channels in real time without any bulky memory. Based on our previous one-channel NSP of NEUSORT1.0 [1], the proposed memory hierarchy is realized on NEUSORT2.0 for a 16-channel neural recording system. Compared to 16 of NEUSORT1.0, NEUSORT2.0 demonstrates a 81.50% saving in terms of areaxpower factor.

  1. A fast track trigger processor for the OPAL experiment at LEP, CERN

    SciTech Connect

    Bramhall, M.; Jaroslawski, S.; Penton, A.; Hammarstrom, R.; Joos, D.; Weber, C.

    1989-02-01

    A fast programmable trigger processor for the OPAL experiment is described. The processor can handle multihit events. The tracks are found in the R-Z and the R-PHI planes by 24 fast track finder circuits operating in parallel using a novel histogramming technique. A semicustom coincidence array circuit is used to match tracks.

  2. Electrostatic quadrupole array for focusing parallel beams of charged particles

    DOEpatents

    Brodowski, John

    1982-11-23

    An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

  3. Research of control system stability in solar array simulator with continuous power amplifier of parallel type

    NASA Astrophysics Data System (ADS)

    Mizrah, E. A.; Tkachev, S. B.; Shtabel, N. V.

    2015-10-01

    Solar array simulators are nonlinear control systems designed to reproduce static and dynamic characteristics of solar array. Solar array characteristics depend on illumination, temperature, space environment and other causes. During on-earth testing of spacecraft power systems there is a problem reaching stable work of simulator with different impedance loads in wide range load regulation. In the article authors propose a research method for absolute process stability in solar array simulators and present results of absolute stability research for solar array simulator with continuous parallel type power amplifier.

  4. High-performance ultra-low power VLSI analog processor for data compression

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul (Inventor)

    1996-01-01

    An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

  5. Experience in highly parallel processing using DAP

    NASA Technical Reports Server (NTRS)

    Parkinson, D.

    1987-01-01

    Distributed Array Processors (DAP) have been in day to day use for ten years and a large amount of user experience has been gained. The profile of user applications is similar to that of the Massively Parallel Processor (MPP) working group. Experience has shown that contrary to expectations, highly parallel systems provide excellent performance on so-called dirty problems such as the physics part of meteorological codes. The reasons for this observation are discussed. The arguments against replacing bit processors with floating point processors are also discussed.

  6. A frequency and sensitivity tunable microresonator array for high-speed quantum processor readout

    NASA Astrophysics Data System (ADS)

    Whittaker, J. D.; Swenson, L. J.; Volkmann, M. H.; Spear, P.; Altomare, F.; Berkley, A. J.; Bumble, B.; Bunyk, P.; Day, P. K.; Eom, B. H.; Harris, R.; Hilton, J. P.; Hoskinson, E.; Johnson, M. W.; Kleinsasser, A.; Ladizinsky, E.; Lanting, T.; Oh, T.; Perminov, I.; Tolkacheva, E.; Yao, J.

    2016-01-01

    Superconducting microresonators have been successfully utilized as detection elements for a wide variety of applications. With multiplexing factors exceeding 1000 detectors per transmission line, they are the most scalable low-temperature detector technology demonstrated to date. For high-throughput applications, fewer detectors can be coupled to a single wire but utilize a larger per-detector bandwidth. For all existing designs, fluctuations in fabrication tolerances result in a non-uniform shift in resonance frequency and sensitivity, which ultimately limits the efficiency of bandwidth utilization. Here, we present the design, implementation, and initial characterization of a superconducting microresonator readout integrating two tunable inductances per detector. We demonstrate that these tuning elements provide independent control of both the detector frequency and sensitivity, allowing us to maximize the transmission line bandwidth utilization. Finally, we discuss the integration of these detectors in a multilayer fabrication stack for high-speed readout of the D-Wave quantum processor, highlighting the use of control and routing circuitry composed of single-flux-quantum loops to minimize the number of control wires at the lowest temperature stage.

  7. Parallel array of independent thermostats for column separations

    DOEpatents

    Foret, Frantisek; Karger, Barry L.

    2005-08-16

    A thermostat array including an array of two or more capillary columns (10) or two or more channels in a microfabricated device is disclosed. A heat conductive material (12) surrounded each individual column or channel in array, each individual column or channel being thermally insulated from every other individual column or channel. One or more independently controlled heating or cooling elements (14) is positioned adjacent to individual columns or channels within the heat conductive material, each heating or cooling element being connected to a source of heating or cooling, and one or more independently controlled temperature sensing elements (16) is positioned adjacent to the individual columns or channels within the heat conductive material. Each temperature sensing element is connected to a temperature controller.

  8. Achieving supercomputer performance for neural net simulation with an array of digital signal processors

    SciTech Connect

    Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W.

    1992-10-01

    Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

  9. Frequency and sensitivity tunable microresonator array for high-speed quantum processor readout

    NASA Astrophysics Data System (ADS)

    Hoskinson, Emile; Whittaker, J. D.; Swenson, L. J.; Volkmann, M. H.; Spear, P.; Altomare, F.; Berkley, A. J.; Bumble, B.; Bunyk, P.; Day, P. K.; Eom, B. H.; Harris, R.; Hilton, J. P.; Johnson, M. W.; Kleinsasser, A.; Ladizinsky, E.; Lanting, T.; Oh, T.; Perminov, I.; Tolkacheva, E.; Yao, J.

    Frequency multiplexed arrays of superconducting microresonators have been used as detectors in a variety of applications. The degree of multiplexing achievable is limited by fabrication variation causing non-uniform shifts in resonator frequencies. We have designed, implemented and characterized a superconducting microresonator readout that incorporates two tunable inductances per detector, allowing independent control of each detector frequency and sensitivity. The tunable inductances are adjusted using on-chip programmable digital-to-analog flux converters, which are programmed with a scalable addressing scheme that requires few external lines.

  10. Design and fabrication of diffractive microlens arrays with continuous relief for parallel laser direct writing.

    PubMed

    Tan, Jiubin; Shan, Mingguang; Zhao, Chenguang; Liu, Jian

    2008-04-01

    Diffractive microlens arrays with continuous relief are designed, fabricated, and characterized by using Fermat's principle to create an array of spots on the photoresist-coated surface of a substrate for parallel laser direct writing. Experimental results indicate that a diffraction efficiency of 71.4% and a spot size of 1.97 microm (FWHM) can be achieved at normal incidence and a writing laser wavelength of 441.6 nm with an array of F/4 fabricated on fused silica, and the developed array can be used to improve the utilization ratio of writing laser energy. PMID:18382568

  11. Using a Cray Y-MP as an array processor for a RISC Workstation

    NASA Technical Reports Server (NTRS)

    Lamaster, Hugh; Rogallo, Sarah J.

    1992-01-01

    As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.

  12. Sequence information signal processor

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1999-01-01

    An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

  13. Fully parallel write/read in resistive synaptic array for accelerating on-chip learning

    NASA Astrophysics Data System (ADS)

    Gao, Ligang; Wang, I.-Ting; Chen, Pai-Yu; Vrudhula, Sarma; Seo, Jae-sun; Cao, Yu; Hou, Tuo-Hung; Yu, Shimeng

    2015-11-01

    A neuro-inspired computing paradigm beyond the von Neumann architecture is emerging and it generally takes advantage of massive parallelism and is aimed at complex tasks that involve intelligence and learning. The cross-point array architecture with synaptic devices has been proposed for on-chip implementation of the weighted sum and weight update in the learning algorithms. In this work, forming-free, silicon-process-compatible Ta/TaO x /TiO2/Ti synaptic devices are fabricated, in which >200 levels of conductance states could be continuously tuned by identical programming pulses. In order to demonstrate the advantages of parallelism of the cross-point array architecture, a novel fully parallel write scheme is designed and experimentally demonstrated in a small-scale crossbar array to accelerate the weight update in the training process, at a speed that is independent of the array size. Compared to the conventional row-by-row write scheme, it achieves >30× speed-up and >30× improvement in energy efficiency as projected in a large-scale array. If realistic synaptic device characteristics such as device variations are taken into an array-level simulation, the proposed array architecture is able to achieve ∼95% recognition accuracy of MNIST handwritten digits, which is close to the accuracy achieved by software using the ideal sparse coding algorithm.

  14. Mitigation of cache memory using an embedded hard-core PPC440 processor in a Virtex-5 Field Programmable Gate Array.

    SciTech Connect

    Learn, Mark Walter

    2010-02-01

    Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not available to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.

  15. Parallel implementation of backpropagation neural networks on a heterogeneous array of transputers.

    PubMed

    Foo, S K; Saratchandran, P; Sundararajan, N

    1997-01-01

    This paper analyzes parallel implementation of the backpropagation training algorithm on a heterogeneous transputer network (i.e., transputers of different speed and memory) connected in a pipelined ring topology. Training-set parallelism is employed as the parallelizing paradigm for the backpropagation algorithm. It is shown through analysis that finding the optimal allocation of the training patterns amongst the processors to minimize the time for a training epoch is a mixed integer programming problem. Using mixed integer programming optimal pattern allocations for heterogeneous processor networks having a mixture of T805-20 (20 MHz) and T805-25 (25 MHz) transputers are theoretically found for two benchmark problems. The time for an epoch corresponding to the optimal pattern allocations is then obtained experimentally for the benchmark problems from the T805-20, TS805-25 heterogeneous networks. A Monte Carlo simulation study is carried out to statistically verify the optimality of the epoch time obtained from the mixed integer programming based allocations. In this study pattern allocations are randomly generated and the corresponding time for an epoch is experimentally obtained from the heterogeneous network. The mean and standard deviation for the epoch times from the random allocations are then compared with the optimal epoch time. The results show the optimal epoch time to be always lower than the mean epoch times by more than three standard deviations (3sigma) for all the sample sizes used in the study thus giving validity to the theoretical analysis.

  16. Series-Parallel Superconducting Quantum Interference Device Arrays Using High-TC Ion Damage Junctions

    NASA Astrophysics Data System (ADS)

    Wong, Travis; Mukhanov, Oleg

    2015-03-01

    We have fabricated several designs of three junction series-parallel DC Superconducting Quantum Interference Device (BiSQUID) arrays in YBa2Cu3O7-x using 104 ion damage Josephson Junctions on a single 1 cm2 chip. A high aspect ratio ion implantation mask (30:1 ratio) with 30 nm slits was fabricated using electron beam lithography and low pressure reactive ion etching. Samples were irradiated with 60 keV helium ions to achieve a highly uniform damaged region throughout the thickness of the YBCO thin film as confirmed with Monte Carlo ion implantation simulations. Low frequency measurements of four different BiSQUID series-parallel SQUID array devices will be presented to investigate the effect of the BiSQUID design parameters on the linearity of the SQUID array in response to magnetic fields. BiSQUID arrays could provide a promising architecture for improved linearity transimpedance amplifiers with high linearity.

  17. Development of Microreactor Array Chip-Based Measurement System for Massively Parallel Analysis of Enzymatic Activity

    NASA Astrophysics Data System (ADS)

    Hosoi, Yosuke; Akagi, Takanori; Ichiki, Takanori

    Microarray chip technology such as DNA chips, peptide chips and protein chips is one of the promising approaches for achieving high-throughput screening (HTS) of biomolecule function since it has great advantages in feasibility of automated information processing due to one-to-one indexing between array position and molecular function as well as massively parallel sample analysis as a benefit of down-sizing and large-scale integration. Mostly, however, the function that can be evaluated by such microarray chips is limited to affinity of target molecules. In this paper, we propose a new HTS system of enzymatic activity based on microreactor array chip technology. A prototype of the automated and massively parallel measurement system for fluorometric assay of enzymatic reactions was developed by the combination of microreactor array chips and a highly-sensitive fluorescence microscope. Design strategy of microreactor array chips and an optical measurement platform for the high-throughput enzyme assay are discussed.

  18. Breast ultrasound tomography with two parallel transducer arrays: preliminary clinical results

    NASA Astrophysics Data System (ADS)

    Huang, Lianjie; Shin, Junseob; Chen, Ting; Lin, Youzuo; Intrator, Miranda; Hanson, Kenneth; Epstein, Katherine; Sandoval, Daniel; Williamson, Michael

    2015-03-01

    Ultrasound tomography has great potential to provide quantitative estimations of physical properties of breast tumors for accurate characterization of breast cancer. We design and manufacture a new synthetic-aperture breast ultrasound tomography system with two parallel transducer arrays. The distance of these two transducer arrays is adjustable for scanning breasts with different sizes. The ultrasound transducer arrays are translated vertically to scan the entire breast slice by slice and acquires ultrasound transmission and reflection data for whole-breast ultrasound imaging and tomographic reconstructions. We use the system to acquire patient data at the University of New Mexico Hospital for clinical studies. We present some preliminary imaging results of in vivo patient ultrasound data. Our preliminary clinical imaging results show promising of our breast ultrasound tomography system with two parallel transducer arrays for breast cancer imaging and characterization.

  19. Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

    SciTech Connect

    Werner, N.E.; Van Matre, S.W.

    1985-05-01

    This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.

  20. Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures

    NASA Astrophysics Data System (ADS)

    Olson, Richard F.

    2013-05-01

    Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

  1. Multicoil resonance-based parallel array for smart wireless power delivery.

    PubMed

    Mirbozorgi, S A; Sawan, M; Gosselin, B

    2013-01-01

    This paper presents a novel resonance-based multicoil structure as a smart power surface to wirelessly power up apparatus like mobile, animal headstage, implanted devices, etc. The proposed powering system is based on a 4-coil resonance-based inductive link, the resonance coil of which is formed by an array of several paralleled coils as a smart power transmitter. The power transmitter employs simple circuit connections and includes only one power driver circuit per multicoil resonance-based array, which enables higher power transfer efficiency and power delivery to the load. The power transmitted by the driver circuit is proportional to the load seen by the individual coil in the array. Thus, the transmitted power scales with respect to the load of the electric/electronic system to power up, and does not divide equally over every parallel coils that form the array. Instead, only the loaded coils of the parallel array transmit significant part of total transmitted power to the receiver. Such adaptive behavior enables superior power, size and cost efficiency then other solutions since it does not need to use complex detection circuitry to find the location of the load. The performance of the proposed structure is verified by measurement results. Natural load detection and covering 4 times bigger area than conventional topologies with a power transfer efficiency of 55% are the novelties of presented paper.

  2. A class of parallel algorithms for computation of the manipulator inertia matrix

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Bejczy, Antal K.

    1989-01-01

    Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

  3. 10-channel fiber array fabrication technique for parallel optical coherence tomography system

    NASA Astrophysics Data System (ADS)

    Arauz, Lina J.; Luo, Yuan; Castillo, Jose E.; Kostuk, Raymond K.; Barton, Jennifer

    2007-02-01

    Optical Coherence Tomography (OCT) shows great promise for low intrusive biomedical imaging applications. A parallel OCT system is a novel technique that replaces mechanical transverse scanning with electronic scanning. This will reduce the time required to acquire image data. In this system an array of small diameter fibers is required to obtain an image in the transverse direction. Each fiber in the array is configured in an interferometer and is used to image one pixel in the transverse direction. In this paper we describe a technique to package 15μm diameter fibers on a siliconsilica substrate to be used in a 2mm endoscopic probe tip. Single mode fibers are etched to reduce the cladding diameter from 125μm to 15μm. Etched fibers are placed into a 4mm by 150μm trench in a silicon-silica substrate and secured with UV glue. Active alignment was used to simplify the lay out of the fibers and minimize unwanted horizontal displacement of the fibers. A 10-channel fiber array was built, tested and later incorporated into a parallel optical coherence system. This paper describes the packaging, testing, and operation of the array in a parallel OCT system.

  4. dc properties of series-parallel arrays of Josephson junctions in an external magnetic field

    SciTech Connect

    Lewandowski, S.J. )

    1991-04-01

    A detailed dc theory of superconducting multijunction interferometers has previously been developed by several authors for the case of parallel junction arrays. The theory is now extended to cover the case of a loop containing several junctions connected in series. The problem is closely associated with high-{ital T}{sub {ital c}} superconductors and their clusters of intrinsic Josephson junctions. These materials exhibit spontaneous interferometric effects, and there is no reason to assume that the intrinsic junctions form only parallel arrays. A simple formalism of phase states is developed in order to express the superconducting phase differences across the junctions forming a series array as functions of the phase difference across the weakest junction of the system, and to relate the differences in critical currents of the junctions to gaps in the allowed ranges of their phase functions. This formalism is used to investigate the energy states of the array, which in the case of different junctions are split and separated by energy barriers of height depending on the phase gaps. Modifications of the washboard model of a single junction are shown. Next a superconducting inductive loop containing a series array of two junctions is considered, and this model is used to demonstrate the transitions between phase states and the associated instabilities. Finally, the critical current of a parallel connection of two series arrays is analyzed and shown to be a multivalued function of the externally applied magnetic flux. The instabilities caused by the presence of intrinsic serial junctions in granular high-{ital T}{sub {ital c}} materials are pointed out as a potential source of additional noise.

  5. Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit

    SciTech Connect

    Daily, Jeffrey A.; Lewis, Robert R.

    2011-11-30

    Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

  6. Anisotropic charge and heat conduction through arrays of parallel elliptic cylinders in a continuous medium

    NASA Astrophysics Data System (ADS)

    Martin, James E.; Ribaudo, Troy

    2013-04-01

    Arrays of circular pores in silicon can exhibit a phononic bandgap when the lattice constant is smaller than the phonon scattering length, and so have become of interest for use as thermoelectric materials, due to the large reduction in thermal conductivity that this bandgap can cause. The reduction in electrical conductivity is expected to be less, because the lattice constant of these arrays is engineered to be much larger than the electron scattering length. As a result, electron transport through the effective medium is well described by the diffusion equation, and the Seebeck coefficient is expected to increase. In this paper, we develop an expression for the purely diffusive thermal (or electrical) conductivity of a composite comprised of square or hexagonal arrays of parallel circular or elliptic cylinders of one material in a continuum of a second material. The transport parallel to the cylinders is straightforward, so we consider the transport in the two principal directions normal to the cylinders, using a self-consistent local field calculation based on the point dipole approximation. There are two limiting cases: large negative contrast (e.g., pores in a conductor) and large positive contrast (conducting pillars in air). In the large negative contrast case, the transport is only slightly affected parallel to the major axis of the elliptic cylinders but can be significantly affected parallel to the minor axis, even in the limit of zero volume fraction of pores. The positive contrast case is just the opposite: the transport is only slightly affected parallel to the minor axis of the pillars but can be significantly affected parallel to the major axis, even in the limit of zero volume fraction of pillars. The analytical results are compared to extensive FEA calculations obtained using Comsol™ and the agreement is generally very good, provided the cylinders are sufficiently small compared to the lattice constant.

  7. Performance of the UCAN2 Gyrokinetic Particle In Cell (PIC) Code on Two Massively Parallel Mainframes with Intel ``Sandy Bridge'' Processors

    NASA Astrophysics Data System (ADS)

    Leboeuf, Jean-Noel; Decyk, Viktor; Newman, David; Sanchez, Raul

    2013-10-01

    The massively parallel, 2D domain-decomposed, nonlinear, 3D, toroidal, electrostatic, gyrokinetic, Particle in Cell (PIC), Cartesian geometry UCAN2 code, with particle ions and adiabatic electrons, has been ported to two emerging mainframes. These two computers, one at NERSC in the US built by Cray named Edison and the other at the Barcelona Supercomputer Center (BSC) in Spain built by IBM named MareNostrum III (MNIII) just happen to share the same Intel ``Sandy Bridge'' processors. The successful port of UCAN2 to MNIII which came online first has enabled us to be up and running efficiently in record time on Edison. Overall, the performance of UCAN2 on Edison is superior to that on MNIII, particularly at large numbers of processors (>1024) for the same Intel IFORT compiler. This appears to be due to different MPI modules (OpenMPI on MNIII and MPICH2 on Edison) and different interconnection networks (Infiniband on MNIII and Cray's Aries on Edison) on the two mainframes. Details of these ports and comparative benchmarks are presented. Work supported by OFES, USDOE, under contract no. DE-FG02-04ER54741 with the University of Alaska at Fairbanks.

  8. Parallel force measurement with a polymeric microbeam array using an optical microscope and micromanipulator.

    PubMed

    Sasoglu, F Mert; Bohl, Andrew J; Allen, Kathleen B; Layton, Bradley E

    2009-01-01

    An image analysis method and its validation are presented for tracking the displacements of parallel mechanical force sensors. Force is measured using a combination of beam theory, optical microscopy, and image analysis. The primary instrument is a calibrated polymeric microbeam array mounted on a micromanipulator with the intended purpose of measuring traction forces on cell cultures or cell arrays. One application is the testing of hypotheses involving cellular mechanotransduction mechanisms. An Otsu-based image analysis code calculates displacement and force on cellular or other soft structures by using edge detection and image subtraction on digitally captured optical microscopy images. Forces as small as 250+/-50 nN and as great as 25+/-2.5 microN may be applied and measured upon as few as one or as many as hundreds of structures in parallel. A validation of the method is provided by comparing results from a rigid glass surface and a compliant polymeric surface.

  9. SIMD massively parallel processing system for real-time image processing

    NASA Astrophysics Data System (ADS)

    Chen, Xiaochu; Zhang, Ming; Yao, Qingdong; Liu, Jilin; Ye, Hong; Wu, Song; Li, Dongxiao; Zhang, Yong; Ding, Lei; Yao, Zhongyang; Yang, Weijian; Pan, Qiaohai

    1998-09-01

    This paper will describe the embedded SIMD massively parallel processor that we have developed for real-time image processing applications, such as real-time small target detection and tracking and video processing. The processor array is based on SIMD chip BAP-128 designed by our own, and uses high performance DSP TMS320C31, which can effectively perform serial and floating point calculations, as the host of the SIMD processor array. As a result, the system is able to perform a variety of image processing tasks in real-time. Furthermore, the processor will be connected with a MIMD parallel processor to construct a heterogeneously parallel processor for more complex real- time ATR (Automatic Target Recognition) and computer vision applications.

  10. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications.

    PubMed

    Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun

    2015-12-01

    Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art. PMID:26780817

  11. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications.

    PubMed

    Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun

    2015-12-01

    Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.

  12. Excitation of a Parallel Plate Waveguide by an Array of Rectangular Waveguides

    NASA Technical Reports Server (NTRS)

    Rengarajan, Sembiam

    2011-01-01

    This work addresses the problem of excitation of a parallel plate waveguide by an array of rectangular waveguides that arises in applications such as the continuous transverse stub (CTS) antenna and dual-polarized parabolic cylindrical reflector antennas excited by a scanning line source. In order to design the junction region between the parallel plate waveguide and the linear array of rectangular waveguides, waveguide sizes have to be chosen so that the input match is adequate for the range of scan angles for both polarizations. Electromagnetic wave scattered by the junction of a parallel plate waveguide by an array of rectangular waveguides is analyzed by formulating coupled integral equations for the aperture electric field at the junction. The integral equations are solved by the method of moments. In order to make the computational process efficient and accurate, the method of weighted averaging was used to evaluate rapidly oscillating integrals encountered in the moment matrix. In addition, the real axis spectral integral is evaluated in a deformed contour for speed and accuracy. The MoM results for a large finite array have been validated by comparing its reflection coefficients with corresponding results for an infinite array generated by the commercial finite element code, HFSS. Once the aperture electric field is determined by MoM, the input reflection coefficients at each waveguide port, and coupling for each polarization over the range of useful scan angles, are easily obtained. Results for the input impedance and coupling characteristics for both the vertical and horizontal polarizations are presented over a range of scan angles. It is shown that the scan range is limited to about 35 for both polarizations and therefore the optimum waveguide is a square of size equal to about 0.62 free space wavelength.

  13. Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation.

    PubMed

    Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan

    2015-08-01

    Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ∼7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection. PMID:26180050

  14. Transmit and receive transmission line arrays for 7 Tesla parallel imaging.

    PubMed

    Adriany, Gregor; Van de Moortele, Pierre-Francois; Wiesinger, Florian; Moeller, Steen; Strupp, John P; Andersen, Peter; Snyder, Carl; Zhang, Xiaoliang; Chen, Wei; Pruessmann, Klaas P; Boesiger, Peter; Vaughan, Tommy; Uğurbil, Kāmil

    2005-02-01

    Transceive array coils, capable of RF transmission and independent signal reception, were developed for parallel, 1H imaging applications in the human head at 7 T (300 MHz). The coils combine the advantages of high-frequency properties of transmission lines with classic MR coil design. Because of the short wavelength at the 1H frequency at 300 MHz, these coils were straightforward to build and decouple. The sensitivity profiles of individual coils were highly asymmetric, as expected at this high frequency; however, the summed images from all coils were relatively uniform over the whole brain. Data were obtained with four- and eight-channel transceive arrays built using a loop configuration and compared to arrays built from straight stripline transmission lines. With both the four- and the eight-channel arrays, parallel imaging with sensitivity encoding with high reduction numbers was feasible at 7 T in the human head. A one-dimensional reduction factor of 4 was robustly achieved with an average g value of 1.25 with the eight-channel transmit/receive coils.

  15. Range and egomotion estimation from compound photodetector arrays with parallel optical axis using optical flow techniques.

    PubMed

    Chahl, J S

    2014-01-20

    This paper describes an application for arrays of narrow-field-of-view sensors with parallel optical axes. These devices exhibit some complementary characteristics with respect to conventional perspective projection or angular projection imaging devices. Conventional imaging devices measure rotational egomotion directly by measuring the angular velocity of the projected image. Translational egomotion cannot be measured directly by these devices because the induced image motion depends on the unknown range of the viewed object. On the other hand, a known translational motion generates image velocities which can be used to recover the ranges of objects and hence the three-dimensional (3D) structure of the environment. A new method is presented for computing egomotion and range using the properties of linear arrays of independent narrow-field-of-view optical sensors. An approximate parallel projection can be used to measure translational egomotion in terms of the velocity of the image. On the other hand, a known rotational motion of the paraxial sensor array generates image velocities, which can be used to recover the 3D structure of the environment. Results of tests of an experimental array confirm these properties.

  16. Nanopore arrays in a silicon membrane for parallel single-molecule detection: DNA translocation

    NASA Astrophysics Data System (ADS)

    Zhang, Miao; Schmidt, Torsten; Jemt, Anders; Sahlén, Pelin; Sychugov, Ilya; Lundeberg, Joakim; Linnros, Jan

    2015-08-01

    Optical nanopore sensing offers great potential in single-molecule detection, genotyping, or DNA sequencing for high-throughput applications. However, one of the bottle-necks for fluorophore-based biomolecule sensing is the lack of an optically optimized membrane with a large array of nanopores, which has large pore-to-pore distance, small variation in pore size and low background photoluminescence (PL). Here, we demonstrate parallel detection of single-fluorophore-labeled DNA strands (450 bps) translocating through an array of silicon nanopores that fulfills the above-mentioned requirements for optical sensing. The nanopore array was fabricated using electron beam lithography and anisotropic etching followed by electrochemical etching resulting in pore diameters down to ∼7 nm. The DNA translocation measurements were performed in a conventional wide-field microscope tailored for effective background PL control. The individual nanopore diameter was found to have a substantial effect on the translocation velocity, where smaller openings slow the translocation enough for the event to be clearly detectable in the fluorescence. Our results demonstrate that a uniform silicon nanopore array combined with wide-field optical detection is a promising alternative with which to realize massively-parallel single-molecule detection.

  17. Comparing a new laser strainmeter array with an adjacent, parallel running quartz tube strainmeter array.

    PubMed

    Kobe, Martin; Jahr, Thomas; Pöschel, Wolfgang; Kukowski, Nina

    2016-03-01

    In summer 2011, two new laser strainmeters about 26.6 m long were installed in N-S and E-W directions parallel to an existing quartz tube strainmeter system at the Geodynamic Observatory Moxa, Thuringia/Germany. This kind of installation is unique in the world and allows the direct comparison of measurements of horizontal length changes with different types of strainmeters for the first time. For the comparison of both data sets, we used the tidal analysis over three years, the strain signals resulting from drilling a shallow 100 m deep borehole on the ground of the observatory and long-period signals. The tidal strain amplitude factors of the laser strainmeters are found to be much closer to theoretical values (85%-105% N-S and 56%-92% E-W) than those of the quartz tube strainmeters. A first data analysis shows that the new laser strainmeters are more sensitive in the short-periodic range with an improved signal-to-noise ratio and distinctly more stable during long-term drifts of environmental parameters such as air pressure or groundwater level. We compared the signal amplitudes of both strainmeter systems at variable signal periods and found frequency-dependent amplitude differences. Confirmed by the tidal parameters, we have now a stable and high resolution laser strainmeter system that serves as calibration reference for quartz tube strainmeters. PMID:27036794

  18. A Full Parallel Event Driven Readout Technique for Area Array SPAD FLIM Image Sensors

    PubMed Central

    Nie, Kaiming; Wang, Xinlei; Qiao, Jun; Xu, Jiangtao

    2016-01-01

    This paper presents a full parallel event driven readout method which is implemented in an area array single-photon avalanche diode (SPAD) image sensor for high-speed fluorescence lifetime imaging microscopy (FLIM). The sensor only records and reads out effective time and position information by adopting full parallel event driven readout method, aiming at reducing the amount of data. The image sensor includes four 8 × 8 pixel arrays. In each array, four time-to-digital converters (TDCs) are used to quantize the time of photons’ arrival, and two address record modules are used to record the column and row information. In this work, Monte Carlo simulations were performed in Matlab in terms of the pile-up effect induced by the readout method. The sensor’s resolution is 16 × 16. The time resolution of TDCs is 97.6 ps and the quantization range is 100 ns. The readout frame rate is 10 Mfps, and the maximum imaging frame rate is 100 fps. The chip’s output bandwidth is 720 MHz with an average power of 15 mW. The lifetime resolvability range is 5–20 ns, and the average error of estimated fluorescence lifetimes is below 1% by employing CMM to estimate lifetimes. PMID:26828490

  19. Parallel and series FED microstrip array with high efficiency and low cross polarization

    NASA Technical Reports Server (NTRS)

    Huang, John (Inventor)

    1995-01-01

    A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

  20. High-performance SPAD array detectors for parallel photon timing applications

    NASA Astrophysics Data System (ADS)

    Rech, I.; Cuccato, A.; Antonioli, S.; Cammi, C.; Gulinatti, A.; Ghioni, M.

    2012-02-01

    Over the past few years there has been a growing interest in monolithic arrays of single photon avalanche diodes (SPAD) for spatially resolved detection of faint ultrafast optical signals. SPADs implemented in planar technologies offer the typical advantages of microelectronic devices (small size, ruggedness, low voltage, low power, etc.). Furthermore, they have inherently higher photon detection efficiency than PMTs and are able to provide, beside sensitivities down to single-photons, very high acquisition speeds. In order to make SPAD array more and more competitive in time-resolved application it is necessary to face problems like electrical crosstalk between adjacent pixel, moreover all the singlephoton timing electronics with picosecond resolution has to be developed. In this paper we present a new instrument suitable for single-photon imaging applications and made up of 32 timeresolved parallel channels. The 32x1 pixel array that includes SPAD detectors represents the system core, and an embedded data elaboration unit performs on-board data processing for single-photon counting applications. Photontiming information is exported through a custom parallel cable that can be connected to an external multichannel TCSPC system.

  1. Dynamic scheduling and planning parallel observations on large Radio Telescope Arrays with the Square Kilometre Array in mind

    NASA Astrophysics Data System (ADS)

    Buchner, Johannes

    2011-12-01

    Scheduling, the task of producing a time table for resources and tasks, is well-known to be a difficult problem the more resources are involved (a NP-hard problem). This is about to become an issue in Radio astronomy as observatories consisting of hundreds to thousands of telescopes are planned and operated. The Square Kilometre Array (SKA), which Australia and New Zealand bid to host, is aiming for scales where current approaches -- in construction, operation but also scheduling -- are insufficent. Although manual scheduling is common today, the problem is becoming complicated by the demand for (1) independent sub-arrays doing simultaneous observations, which requires the scheduler to plan parallel observations and (2) dynamic re-scheduling on changed conditions. Both of these requirements apply to the SKA, especially in the construction phase. We review the scheduling approaches taken in the astronomy literature, as well as investigate techniques from human schedulers and today's observatories. The scheduling problem is specified in general for scientific observations and in particular on radio telescope arrays. Also taken into account is the fact that the observatory may be oversubscribed, requiring the scheduling problem to be integrated with a planning process. We solve this long-term scheduling problem using a time-based encoding that works in the very general case of observation scheduling. This research then compares algorithms from various approaches, including fast heuristics from CPU scheduling, Linear Integer Programming and Genetic algorithms, Branch-and-Bound enumeration schemes. Measures include not only goodness of the solution, but also scalability and re-scheduling capabilities. In conclusion, we have identified a fast and good scheduling approach that allows (re-)scheduling difficult and changing problems by combining heuristics with a Genetic algorithm using block-wise mutation operations. We are able to explain and eradicate two problems in the

  2. Design and implementation of a parallel array operator for the arbitrary remapping of data.

    SciTech Connect

    Dietz, Steven; Choi, S. E.; Chamberlain, B. L.; Snyder, Lawrence

    2003-01-01

    The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.

  3. Multi-focus parallel detection of fluorescent molecules at picomolar concentration with photonic nanojets arrays

    SciTech Connect

    Ghenuche, Petru; Torres, Juan de; Ferrand, Patrick; Wenger, Jérôme

    2014-09-29

    Fluorescence sensing and fluorescence correlation spectroscopy (FCS) are powerful methods to detect and characterize single molecules; yet, their use has been restricted by expensive and complex optical apparatus. Here, we present a simple integrated design using a self-assembled bi-dimensional array of microspheres to realize multi-focus parallel detection scheme for FCS. We simultaneously illuminate and collect the fluorescence from several tens of microspheres, which all generate their own photonic nanojet to efficiently excite the molecules and collect the fluorescence emission. Each photonic nanojet contributes to the global detection volume, reaching FCS detection volumes of several tens of femtoliters while preserving the fluorescence excitation and collection efficiencies. The microspheres photonic nanojets array enables FCS experiments at low picomolar concentrations with a drastic reduction in apparatus cost and alignment constraints, ideal for microfluidic chip integration.

  4. Nanopore arrays in a silicon membrane for parallel single-molecule detection: fabrication

    NASA Astrophysics Data System (ADS)

    Schmidt, Torsten; Zhang, Miao; Sychugov, Ilya; Roxhed, Niclas; Linnros, Jan

    2015-08-01

    Solid state nanopores enable translocation and detection of single bio-molecules such as DNA in buffer solutions. Here, sub-10 nm nanopore arrays in silicon membranes were fabricated by using electron-beam lithography to define etch pits and by using a subsequent electrochemical etching step. This approach effectively decouples positioning of the pores and the control of their size, where the pore size essentially results from the anodizing current and time in the etching cell. Nanopores with diameters as small as 7 nm, fully penetrating 300 nm thick membranes, were obtained. The presented fabrication scheme to form large arrays of nanopores is attractive for parallel bio-molecule sensing and DNA sequencing using optical techniques. In particular the signal-to-noise ratio is improved compared to other alternatives such as nitride membranes suffering from a high-luminescence background.

  5. Parallel array and mixture-based synthetic combinatorial chemistry: tools for the next millennium.

    PubMed

    Houghten, R A

    2000-01-01

    Technological advances continue to be a central driving force in the acceleration of the drug discovery process. Combinatorial chemistry methods, developed over the past 15 years, represent a paradigm shift in drug discovery. Initially viewed as a curiosity by the pharmaceutical industry, combinatorial chemistry is now recognized as an essential tool that decreases the time of discovery and increases the throughput of chemical screening by as much as 1000-fold. The use of parallel array synthesis approaches and mixture-based combinatorial libraries for drug discovery is reviewed.

  6. Parallelization and improvements of the generalized born model with a simple sWitching function for modern graphics processors.

    PubMed

    Arthur, Evan J; Brooks, Charles L

    2016-04-15

    Two fundamental challenges of simulating biologically relevant systems are the rapid calculation of the energy of solvation and the trajectory length of a given simulation. The Generalized Born model with a Simple sWitching function (GBSW) addresses these issues by using an efficient approximation of Poisson-Boltzmann (PB) theory to calculate each solute atom's free energy of solvation, the gradient of this potential, and the subsequent forces of solvation without the need for explicit solvent molecules. This study presents a parallel refactoring of the original GBSW algorithm and its implementation on newly available, low cost graphics chips with thousands of processing cores. Depending on the system size and nonbonded force cutoffs, the new GBSW algorithm offers speed increases of between one and two orders of magnitude over previous implementations while maintaining similar levels of accuracy. We find that much of the algorithm scales linearly with an increase of system size, which makes this water model cost effective for solvating large systems. Additionally, we utilize our GPU-accelerated GBSW model to fold the model system chignolin, and in doing so we demonstrate that these speed enhancements now make accessible folding studies of peptides and potentially small proteins.

  7. Real-time processor for staring receivers

    NASA Technical Reports Server (NTRS)

    Hanzal, Brian; Peczalski, Andrzej; Schwanebeck, James; Sanderson, Richard; Fossum, Eric

    1992-01-01

    The design, fabrication, and testing of a state-of-the-art, high-throughput on-focal plane IR-image signal processor is described. The processing functions performed are frame differencing and thresholding. The final focal plane array will consist of a 128 x 128-pixel platinum-silicide detector bump-mounted to an on-chip CCD multiplexer. The processor is in a 128-channel parallel-pipeline format. Each channel consists of a pixel regenerator (charge differencer), 128-pixel frame store CCD memory, pixel differencer, second pixel regenerator, thresholder (analog comparator), and digital latch. Four parallel analog outputs and four parallel digital outputs are included. The digital outputs provide a bit map of the image. All analog clock signals (128 KHz, 256 KHz, and 5 MHz) are generated by on-chip TTL-input clock drivers. TTL clock driver inputs are generated off-chip. The technology is low-temperature surface and buried channel CCD/CMOS/indium bump. The design goal was 8-bit resolution at 77 K and 1000 frames/s. Applications include point- or extended-target motion detection with thresholding. Design trade-offs and enhancements (such as on-chip detector gain compensation and a simple window processor) are discussed.

  8. Dedicated optoelectronic stochastic parallel processor for real-time image processing: motion-detection demonstration and design of a hybrid complementary-metal-oxide semiconductor- self-electro-optic-device-based prototype.

    PubMed

    Cassinelli, A; Chavel, P; Desmulliez, M P

    2001-12-10

    We report experimental results and performance analysis of a dedicated optoelectronic processor that implements stochastic optimization-based image-processing tasks in real time. We first show experimental results using a proof-of-principle-prototype demonstrator based on standard silicon-complementary-metal-oxide-semiconductor (CMOS) technology and liquid-crystal spatial light modulators. We then elaborate on the advantages of using a hybrid CMOS-self-electro-optic-device-based smart-pixel array to monolithically integrate photodetectors and modulators on the same chip, providing compact, high-bandwidth intrachip optoelectronic interconnects. We have modeled the operation of the monolithic processor, clearly showing system-performance improvement.

  9. Two-Dimensional Systolic Array For Kalman-Filter Computing

    NASA Technical Reports Server (NTRS)

    Chang, Jaw John; Yeh, Hen-Geul

    1988-01-01

    Two-dimensional, systolic-array, parallel data processor performs Kalman filtering in real time. Algorithm rearranged to be Faddeev algorithm for generalized signal processing. Algorithm mapped onto very-large-scale integrated-circuit (VLSI) chip in two-dimensional, regular, simple, expandable array of concurrent processing cells. Processor does matrix/vector-based algebraic computations. Applications include adaptive control of robots, remote manipulators and flexible structures and processing radar signals to track targets.

  10. Fast Confocal Raman Imaging Using a 2-D Multifocal Array for Parallel Hyperspectral Detection.

    PubMed

    Kong, Lingbo; Navas-Moreno, Maria; Chan, James W

    2016-01-19

    We present the development of a novel confocal hyperspectral Raman microscope capable of imaging at speeds up to 100 times faster than conventional point-scan Raman microscopy under high noise conditions. The microscope utilizes scanning galvomirrors to generate a two-dimensional (2-D) multifocal array at the sample plane, generating Raman signals simultaneously at each focus of the array pattern. The signals are combined into a single beam and delivered through a confocal pinhole before being focused through the slit of a spectrometer. To separate the signals from each row of the array, a synchronized scan mirror placed in front of the spectrometer slit positions the Raman signals onto different pixel rows of the detector. We devised an approach to deconvolve the superimposed signals and retrieve the individual spectra at each focal position within a given row. The galvomirrors were programmed to scan different focal arrays following Hadamard encoding patterns. A key feature of the Hadamard detection is the reconstruction of individual spectra with improved signal-to-noise ratio. Using polystyrene beads as test samples, we demonstrated not only that our system images faster than a conventional point-scan method but that it is especially advantageous under noisy conditions, such as when the CCD detector operates at fast read-out rates and high temperatures. This is the first demonstration of multifocal confocal Raman imaging in which parallel spectral detection is implemented along both axes of the CCD detector chip. We envision this novel 2-D multifocal spectral detection technique can be used to develop faster imaging spontaneous Raman microscopes with lower cost detectors. PMID:26654100

  11. Large-scale parallel arrays of silicon nanowires via block copolymer directed self-assembly.

    PubMed

    Farrell, Richard A; Kinahan, Niall T; Hansel, Stefan; Stuen, Karl O; Petkov, Nikolay; Shaw, Matthew T; West, Laetitia E; Djara, Vladimir; Dunne, Robert J; Varona, Olga G; Gleeson, Peter G; Jung, Soon-Jung; Kim, Hye-Young; Koleśnik, Maria M; Lutz, Tarek; Murray, Christopher P; Holmes, Justin D; Nealey, Paul F; Duesberg, Georg S; Krstić, Vojislav; Morris, Michael A

    2012-05-21

    Extending the resolution and spatial proximity of lithographic patterning below critical dimensions of 20 nm remains a key challenge with very-large-scale integration, especially if the persistent scaling of silicon electronic devices is sustained. One approach, which relies upon the directed self-assembly of block copolymers by chemical-epitaxy, is capable of achieving high density 1 : 1 patterning with critical dimensions approaching 5 nm. Herein, we outline an integration-favourable strategy for fabricating high areal density arrays of aligned silicon nanowires by directed self-assembly of a PS-b-PMMA block copolymer nanopatterns with a L(0) (pitch) of 42 nm, on chemically pre-patterned surfaces. Parallel arrays (5 × 10(6) wires per cm) of uni-directional and isolated silicon nanowires on insulator substrates with critical dimension ranging from 15 to 19 nm were fabricated by using precision plasma etch processes; with each stage monitored by electron microscopy. This step-by-step approach provides detailed information on interfacial oxide formation at the device silicon layer, the polystyrene profile during plasma etching, final critical dimension uniformity and line edge roughness variation nanowire during processing. The resulting silicon-nanowire array devices exhibit Schottky-type behaviour and a clear field-effect. The measured values for resistivity and specific contact resistance were ((2.6 ± 1.2) × 10(5)Ωcm) and ((240 ± 80) Ωcm(2)) respectively. These values are typical for intrinsic (un-doped) silicon when contacted by high work function metal albeit counterintuitive as the resistivity of the starting wafer (∼10 Ωcm) is 4 orders of magnitude lower. In essence, the nanowires are so small and consist of so few atoms, that statistically, at the original doping level each nanowire contains less than a single dopant atom and consequently exhibits the electrical behaviour of the un-doped host material. Moreover this indicates that the processing

  12. Computation and parallel implementation for early vision

    NASA Technical Reports Server (NTRS)

    Gualtieri, J. Anthony

    1990-01-01

    The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

  13. The Milstar Advanced Processor

    NASA Astrophysics Data System (ADS)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  14. Parallel Aligned Mesopore Arrays in Pyramidal-Shaped Gallium Nitride and Their Photocatalytic Applications.

    PubMed

    Kim, Hee Jun; Park, Joonmo; Ye, Byeong Uk; Yoo, Chul Jong; Lee, Jong-Lam; Ryu, Sang-Wan; Lee, Heon; Choi, Kyoung Jin; Baik, Jeong Min

    2016-07-20

    Parallel aligned mesopore arrays in pyramidal-shaped GaN are fabricated by using an electrochemical anodic etching technique, followed by inductively coupled plasma etching assisted by SiO2 nanosphere lithography, and used as a promising photoelectrode for solar water oxidation. The parallel alignment of the pores of several tens of micrometers scale in length is achieved by the low applied voltage and prepattern guided anodization. The dry etching of single-layer SiO2 nanosphere-coated GaN produces a pyramidal shape of the GaN, making the pores open at both sides and shortening the escape path of evolved gas bubbles produced inside pores during the water oxidation. The absorption spectra show that the light absorption in the UV range is ∼93% and that there is a red shift in the absorption edge by 30 nm, compared with the flat GaN. It also shows a remarkable enhancement in the photocurrent density by 5.3 times, compared with flat GaN. Further enhancement (∼40%) by the deposition of Ni was observed due to the generation of an electric field, which increases the charge separation ratio. PMID:27347685

  15. Parallel Aligned Mesopore Arrays in Pyramidal-Shaped Gallium Nitride and Their Photocatalytic Applications.

    PubMed

    Kim, Hee Jun; Park, Joonmo; Ye, Byeong Uk; Yoo, Chul Jong; Lee, Jong-Lam; Ryu, Sang-Wan; Lee, Heon; Choi, Kyoung Jin; Baik, Jeong Min

    2016-07-20

    Parallel aligned mesopore arrays in pyramidal-shaped GaN are fabricated by using an electrochemical anodic etching technique, followed by inductively coupled plasma etching assisted by SiO2 nanosphere lithography, and used as a promising photoelectrode for solar water oxidation. The parallel alignment of the pores of several tens of micrometers scale in length is achieved by the low applied voltage and prepattern guided anodization. The dry etching of single-layer SiO2 nanosphere-coated GaN produces a pyramidal shape of the GaN, making the pores open at both sides and shortening the escape path of evolved gas bubbles produced inside pores during the water oxidation. The absorption spectra show that the light absorption in the UV range is ∼93% and that there is a red shift in the absorption edge by 30 nm, compared with the flat GaN. It also shows a remarkable enhancement in the photocurrent density by 5.3 times, compared with flat GaN. Further enhancement (∼40%) by the deposition of Ni was observed due to the generation of an electric field, which increases the charge separation ratio.

  16. Peripheral processors for high-speed simulation. [helicopter cockpit simulator

    NASA Technical Reports Server (NTRS)

    Karplus, W. J.

    1977-01-01

    This paper describes some of the results of a study directed to the specification and procurement of a new cockpit simulator for an advanced class of helicopters. A part of the study was the definition of a challenging benchmark problem, and detailed analyses of it were made to assess the suitability of a variety of simulation techniques. The analyses showed that a particularly cost-effective approach to the attainment of adequate speed for this extremely demanding application is to employ a large minicomputer acting as host and controller for a special-purpose digital peripheral processor. Various realizations of such peripheral processors, all employing state-of-the-art electronic circuitry and a high degree of parallelism and pipelining, are available or under development. The types of peripheral processors array processors, simulation-oriented processors, and arrays of processing elements - are analyzed and compared. They are particularly promising approaches which should be suitable for high-speed simulations of all kinds, the cockpit simulator being a case in point.

  17. Multifunctional optical processor based on symbolic substitution

    SciTech Connect

    Casasent, D.P.; Botha, E.C. )

    1989-04-01

    The authors propose an optical multifunctional processor that can perform logic, numeric, pattern recognition, morphological, and inference operations. The ability to perform such diverse functions on one optical processor architecture is unique. The processor uses the technique of symbolic substitution and is based on an optical correlator architecture. Several inputs can be operated on in parallel, and different functions can be performed at one time, making it a multiple-instruction multiple-data processor.

  18. New computing environments:Parallel, vector and systolic

    SciTech Connect

    Wouk, A.

    1986-01-01

    This book presents papers on supercomputers and array processors. Topics considered include nested dissection, the systolic level 2 BLAS, parallel processing a hydrodynamic shock wave problem, MACH-1, portable standard LISP on the Cray, distributed combinator evaluation, performance and library issues, scale problems, multiprocessor architecture, the MIDAS multiprocessor system, parallel algorithms for incompressible and compressible flows on a multiprocessor, and parallel algorithms for elliptic equations.

  19. Synchronizing large systolic arrays

    SciTech Connect

    Fisher, A.L.; Kung, H.T.

    1982-04-01

    Parallel computing structures consist of many processors operating simultaneously. If a concurrent structure is regular, as in the case of systolic array, it may be convenient to think of all processors as operating in lock step. Totally synchronized systems controlled by central clocks are difficult to implement because of the inevitable problem of clock skews and delays. An alternate means of enforcing necessary synchronization is the use of self-timed, asynchronous schemes, at the cost of increased design complexity and hardware cost. Realizing that different circumstances call for different synchronization methods, this paper provides a spectrum of synchronization models; based on the assumptions made for each model, theoretical lower bounds on clock skew are derived, and appropriate or best-possible synchronization schemes for systolic arrays are proposed. This paper represents a first step towards a systematic study of synchronization problems for large systolic arrays.

  20. Highly parallel introduction of nucleic acids into mammalian cells grown in microwell arrays

    PubMed Central

    Jain, Tilak; McBride, Ryan; Head, Steven; Saez, Enrique

    2010-01-01

    High-throughput cell-based screens of genome-size collections of cDNAs and siRNAs have become a powerful tool to annotate the mammalian genome, enabling the discovery of novel genes associated with normal cellular processes and pathogenic states, and the unraveling of genetic networks and signaling pathways in a systems biology approach. However, the capital expenses and the cost of reagents necessary to perform such large screens have limited application of this technology. Efforts to miniaturize the screening process have centered on the development of cellular microarrays created on microscope slides that use chemical means to introduce exogenous genetic material into mammalian cells. While this work has demonstrated the feasibility of screening in very small formats, the use of chemical transfection reagents (effective only in a subset of cell lines and not on primary cells) and the lack of defined borders between cells grown in adjacent microspots containing different genetic material (to prevent cell migration and to aid spot location recognition during imaging and phenotype deconvolution) have hampered the spread of this screening technology. Here, we describe proof-of-principles experiments to circumvent these drawbacks. We have created microwell arrays on an electroporation-ready transparent substrate and established procedures to achieve highly efficient parallel introduction of exogenous molecules into human cell lines and primary mouse macrophages. The microwells confine cells and offer multiple advantages during imaging and phenotype analysis. We have also developed a simple method to load this 484-microwell array with libraries of nucleic acids using a standard microarrayer. These advances can be elaborated upon to form the basis of a miniaturized high-throughput functional genomics screening platform to carry out genome-size screens in a variety of mammalian cells that may eventually become a mainstream tool for life science research. PMID:20024036

  1. Highly parallel introduction of nucleic acids into mammalian cells grown in microwell arrays.

    PubMed

    Jain, Tilak; McBride, Ryan; Head, Steven; Saez, Enrique

    2009-12-21

    High-throughput cell-based screens of genome-size collections of cDNAs and siRNAs have become a powerful tool to annotate the mammalian genome, enabling the discovery of novel genes associated with normal cellular processes and pathogenic states, and the unravelling of genetic networks and signaling pathways in a systems biology approach. However, the capital expenses and the cost of reagents necessary to perform such large screens have limited application of this technology. Efforts to miniaturize the screening process have centered on the development of cellular microarrays created on microscope slides that use chemical means to introduce exogenous genetic material into mammalian cells. While this work has demonstrated the feasibility of screening in very small formats, the use of chemical transfection reagents (effective only in a subset of cell lines and not on primary cells) and the lack of defined borders between cells grown in adjacent microspots containing different genetic material (to prevent cell migration and to aid spot location recognition during imaging and phenotype deconvolution) have hampered the spread of this screening technology. Here, we describe proof-of-principles experiments to circumvent these drawbacks. We have created microwell arrays on an electroporation-ready transparent substrate and established procedures to achieve highly efficient parallel introduction of exogenous molecules into human cell lines and primary mouse macrophages. The microwells confine cells and offer multiple advantages during imaging and phenotype analysis. We have also developed a simple method to load this 484-microwell array with libraries of nucleic acids using a standard microarrayer. These advances can be elaborated upon to form the basis of a miniaturized high-throughput functional genomics screening platform to carry out genome-size screens in a variety of mammalian cells that may eventually become a mainstream tool for life science research.

  2. Comparison of 3-D synthetic aperture phased-array ultrasound imaging and parallel beamforming.

    PubMed

    Rasmussen, Morten Fischer; Jensen, Jørgen Arendt

    2014-10-01

    This paper demonstrates that synthetic aperture imaging (SAI) can be used to achieve real-time 3-D ultrasound phased-array imaging. It investigates whether SAI increases the image quality compared with the parallel beamforming (PB) technique for real-time 3-D imaging. Data are obtained using both simulations and measurements with an ultrasound research scanner and a commercially available 3.5- MHz 1024-element 2-D transducer array. To limit the probe cable thickness, 256 active elements are used in transmit and receive for both techniques. The two imaging techniques were designed for cardiac imaging, which requires sequences designed for imaging down to 15 cm of depth and a frame rate of at least 20 Hz. The imaging quality of the two techniques is investigated through simulations as a function of depth and angle. SAI improved the full-width at half-maximum (FWHM) at low steering angles by 35%, and the 20-dB cystic resolution by up to 62%. The FWHM of the measured line spread function (LSF) at 80 mm depth showed a difference of 20% in favor of SAI. SAI reduced the cyst radius at 60 mm depth by 39% in measurements. SAI improved the contrast-to-noise ratio measured on anechoic cysts embedded in a tissue-mimicking material by 29% at 70 mm depth. The estimated penetration depth on the same tissue-mimicking phantom shows that SAI increased the penetration by 24% compared with PB. Neither SAI nor PB achieved the design goal of 15 cm penetration depth. This is likely due to the limited transducer surface area and a low SNR of the experimental scanner used.

  3. Hardware multiplier processor

    DOEpatents

    Pierce, Paul E.

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  4. Hardware multiplier processor

    DOEpatents

    Pierce, P.E.

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  5. Switching An Image Processor Between Two Computers

    NASA Technical Reports Server (NTRS)

    Bodis, Jim; Generazio, Edward R.; Stang, David B.

    1993-01-01

    Remote-control parallel switching circuit connects either of two computers (but not both simultaneously) to image processor. Includes two parallel switches actuated mechanically by solenoids. Each solenoid controlled by solid-state relay connected to remote-control line from associated computer. One or other solenoid energized to connect image processor to computer requesting access according to protocol implemented in software.

  6. Parallel Information Processing.

    ERIC Educational Resources Information Center

    Rasmussen, Edie M.

    1992-01-01

    Examines parallel computer architecture and the use of parallel processors for text. Topics discussed include parallel algorithms; performance evaluation; parallel information processing; parallel access methods for text; parallel and distributed information retrieval systems; parallel hardware for text; and network models for information…

  7. Imer-product array processor for retrieval of stored images represented by bipolar binary (+1,-1) pixels using partial input trinary pixels represented by (+1,-1)

    NASA Technical Reports Server (NTRS)

    Liu, Hua-Kuang (Inventor); Awwal, Abdul A. S. (Inventor); Karim, Mohammad A. (Inventor)

    1993-01-01

    An inner-product array processor is provided with thresholding of the inner product during each iteration to make more significant the inner product employed in estimating a vector to be used as the input vector for the next iteration. While stored vectors and estimated vectors are represented in bipolar binary (1,-1), only those elements of an initial partial input vector that are believed to be common with those of a stored vector are represented in bipolar binary; the remaining elements of a partial input vector are set to 0. This mode of representation, in which the known elements of a partial input vector are in bipolar binary form and the remaining elements are set equal to 0, is referred to as trinary representation. The initial inner products corresponding to the partial input vector will then be equal to the number of known elements. Inner-product thresholding is applied to accelerate convergence and to avoid convergence to a negative input product.

  8. Parameter allocation of parallel array bistable stochastic resonance and its application in communication systems

    NASA Astrophysics Data System (ADS)

    Liu, Jian; Wang, You-Guo; Zhai, Qi-Qing; Liu, Jin

    2016-10-01

    In this paper, we propose a parameter allocation scheme in a parallel array bistable stochastic resonance-based communication system (P-BSR-CS) to improve the performance of weak binary pulse amplitude modulated (BPAM) signal transmissions. The optimal parameter allocation policy of the P-BSR-CS is provided to minimize the bit error rate (BER) and maximize the channel capacity (CC) under the adiabatic approximation condition. On this basis, we further derive the best parameter selection theorem in realistic communication scenarios via variable transformation. Specifically, the P-BSR structure design not only brings the robustness of parameter selection optimization, where the optimal parameter pair is not fixed but variable in quite a wide range, but also produces outstanding system performance. Theoretical analysis and simulation results indicate that in the P-BSR-CS the proposed parameter allocation scheme yields considerable performance improvement, particularly in very low signal-to-noise ratio (SNR) environments. Project supported by the National Natural Science Foundation of China (Grant No. 61179027), the Qinglan Project of Jiangsu Province of China (Grant No. QL06212006), and the University Postgraduate Research and Innovation Project of Jiangsu Province (Grant Nos. KYLX15_0829, KYLX15_0831).

  9. Development of micropump-actuated negative pressure pinched injection for parallel electrophoresis on array microfluidic chip.

    PubMed

    Li, Bowei; Jiang, Lei; Xie, Hua; Gao, Yan; Qin, Jianhua; Lin, Bingcheng

    2009-09-01

    A micropump-actuated negative pressure pinched injection method is developed for parallel electrophoresis on a multi-channel LIF detection system. The system has a home-made device that could individually control 16-port solenoid valves and a high-voltage power supply. The laser beam is excitated and distributes to the array separation channels for detection. The hybrid Glass-PDMS microfluidic chip comprises two common reservoirs, four separation channels coupled to their respective pneumatic micropumps and two reference channels. Due to use of pressure as a driving force, the proposed method has no sample bias effect for separation. There is only one high-voltage supply needed for separation without relying on the number of channels, which is significant for high-throughput analysis, and the time for sample loading is shortened to 1 s. In addition, the integrated micropumps can provide the versatile interface for coupling with other function units to satisfy the complicated demands. The performance is verified by separation of DNA marker and Hepatitis B virus DNA samples. And this method is also expected to show the potential throughput for the DNA analysis in the field of disease diagnosis.

  10. A treatment of the general volume holographic grating as an array of parallel stacked mirrors

    NASA Astrophysics Data System (ADS)

    Brotherton-Ratcliffe, D.

    2012-07-01

    An alternative model to Kogelnik's coupled wave theory of the volume holographic grating is developed in terms of an infinite array of parallel stacked mirrors. The model is based on summing the individual Fresnel reflections from an infinite number of infinitesimal discontinuities in the permittivity profile. The resulting first-order coupled partial differential equations are solved in a rotated frame of reference in order to derive analytical expressions for the diffraction efficiency of the general slanted grating at an arbitrary angle of incidence. The model has been tested using computational solutions of the Helmholtz equation for the unslanted reflection grating. For index modulations characteristic of modern silver halide and photopolymer materials used in display and optical element holography the new model shows excellent agreement with the numerical results. Kogelnik's model also provides good agreement as long as the dephasing parameter is not too large. The model has been tested against Kogelnik's theory for a variety of cases with finite fringe slant with good agreement for typical index modulations. A further advantage of the new model is that colour holographic gratings may be treated at and away from Bragg resonance. Numerical and analytical results are presented concerning the diffractive efficiency of two- and three-colour holographic gratings.

  11. Classification of multimedia processors

    NASA Astrophysics Data System (ADS)

    Fatemi, Omid; Panchanathan, Sethuraman

    1998-12-01

    Multimedia system design presents challenges from the perspectives of both hardware and software. Each media in a multimedia environment requires different processes, techniques, algorithms and hardware implementations. Multimedia processing which necessitates real time digital video, audio, and 3D graphics processing is an essential part of new systems as 2D graphics and image processing was in current systems. Multimedia applications require efficient VLSI implementations for various media processing algorithms. Emerging multimedia standards and algorithms will result in hardware systems of high complexity. In addition to recent advances in enabling VLSI technology for high density and fast circuit implementations, special investigation of architectural approaches is also required. In the past few years, multimedia hardware design has captured the most attentions among researchers. New programmable processors, high-speed storage and modern parallelism techniques are among the variety of subjects, which are being addressed in this domain. A detailed categorization of available multimedia processing strategies is required to help designers in adapting these techniques into new architectures. Some of important options in multimedia hardware design include: processor structure, parallelization and granularity, data distribution techniques, instruction level parallelism, memory interface and flexibility. In this paper, we address important issues in the design of a programmable multimedia processor.

  12. Systolic processor for signal processing

    SciTech Connect

    Frank, G.A.; Greenawalt, E.M.; Kulkarni, A.V.

    1982-01-01

    A systolic array is a natural architecture for a high-performance signal processor, in part because of the extensive use of inner-product operations in signal processing. The modularity and simple interconnection of systolic arrays promise to simplify the development of cost-effective, high-performance, special-purpose processors. ESL incorporated has built a proof of concept model of a systolic processor. It is flexible enough to permit experimentation with a variety of algorithms and applications. ESL is exploring the application of systolic processors to image- and signal-processing problems. This paper describes this experimental system and some of its applications to signal processing. ESL is also pursuing new types of systolic architectures, including the VLSI implementation of systolic cells for solving systems of linear equations. These new systolic architectures allow the real-time design of adaptive filters. 14 references.

  13. The digital signal processor for the ALCOR millimeter wave radar

    NASA Astrophysics Data System (ADS)

    Ford, R. A.

    1980-11-01

    This report describes the use of an array processor for real time radar signal processing. Pulse compression, range marking, and monopulse error computation are some of the functions that will be performed in the array processor for the millimeter wave ALCOR radar augmentation. Real time software design, processor architecture, and system interfaces are discussed in the report.

  14. Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment

    NASA Astrophysics Data System (ADS)

    Cao, Z.; Walsh, J. L.; Kong, M. G.

    2009-01-01

    This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

  15. Parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers

    SciTech Connect

    Tucker, John R.; Baque, Johnathon L.; Lim, Yah Leng; Zvyagin, Andrei V.; Rakic, Aleksandar D

    2007-09-01

    In this paper we investigate the feasibility of a massively parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers (VCSELs) to measure surface profiles of displacement,distance, velocity, and liquid flow rate. The concept of the system is demonstrated using a prototype to measure the velocity at different radial points on a rotating disk, and the velocity profile of diluted milk in a custom built diverging-converging planar flow channel. It is envisaged that a scaled up version of the parallel self-mixing imaging system will enable real-time surface profiling, vibrometry, and flowmetry.

  16. Reconfigurable processor for a data-flow video processing system

    NASA Astrophysics Data System (ADS)

    Acosta, Edward K.; Bove, V. Michael, Jr.; Watlington, John A.; Yu, Ross A.

    1995-09-01

    The Cheops system is a compact, modular platform developed at the MIT Media Laboratory for acquisition, processing, and display of digital video sequences and model-based representations of moving scenes, and is intended as both a laboratory tool and a prototype architecture for future programmable video decoders. Rather than using a set of basic, computationally intensive stream operations that may be performed in parallel and embodies them in specialized hardware. However, Cheops incurs a substantial performance degradation when executing operations for which no specialized processor exists. We have designed a new reconfigurable processor that combines the speed of special purpose stream processors with the flexibility of general-purpose computing as a solution to the problem. Two SRAM based field-programmable gate arrays are used in conjunction with a Power PC 603 processor to provide a flexible computational substrate, which allows algorithms to be mapped to a combination of software and dedicated hardware within the data-flow paradigm. We review the Cheops system architecture, describe the hardware design of the reconfigurable processor, explain the software environment developed to allow dynamic reconfiguration of the device, and report on its performance.

  17. Parallel multispot smFRET analysis using an 8-pixel SPAD array

    PubMed Central

    Ingargiola, A.; Colyer, R. A.; Kim, D.; Panzeri, F.; Lin, R.; Gulinatti, A.; Rech, I.; Ghioni, M.; Weiss, S.; Michalet, X.

    2012-01-01

    Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for extracting distance information between two fluorophores (a donor and acceptor dye) on a nanometer scale. This method is commonly used to monitor binding interactions or intra- and intermolecular conformations in biomolecules freely diffusing through a focal volume or immobilized on a surface. The diffusing geometry has the advantage to not interfere with the molecules and to give access to fast time scales. However, separating photon bursts from individual molecules requires low sample concentrations. This results in long acquisition time (several minutes to an hour) to obtain sufficient statistics. It also prevents studying dynamic phenomena happening on time scales larger than the burst duration and smaller than the acquisition time. Parallelization of acquisition overcomes this limit by increasing the acquisition rate using the same low concentrations required for individual molecule burst identification. In this work we present a new two-color smFRET approach using multispot excitation and detection. The donor excitation pattern is composed of 4 spots arranged in a linear pattern. The fluorescent emission of donor and acceptor dyes is then collected and refocused on two separate areas of a custom 8-pixel SPAD array. We report smFRET measurements performed on various DNA samples synthesized with various distances between the donor and acceptor fluorophores. We demonstrate that our approach provides identical FRET efficiency values to a conventional single-spot acquisition approach, but with a reduced acquisition time. Our work thus opens the way to high-throughput smFRET analysis on freely diffusing molecules. PMID:24382989

  18. Parallel detection of harmful algae using reverse transcription polymerase chain reaction labeling coupled with membrane-based DNA array.

    PubMed

    Zhang, Chunyun; Chen, Guofu; Ma, Chaoshuai; Wang, Yuanyuan; Zhang, Baoyu; Wang, Guangce

    2014-03-01

    Harmful algal blooms (HABs) are a global problem, which can cause economic loss to aquaculture industry's and pose a potential threat to human health. More attention must be made on the development of effective detection methods for the causative microalgae. The traditional microscopic examination has many disadvantages, such as low efficiency, inaccuracy, and requires specialized skill in identification and especially is incompetent for parallel analysis of several morphologically similar microalgae to species level at one time. This study aimed at exploring the feasibility of using membrane-based DNA array for parallel detection of several microalgae by selecting five microaglae, including Heterosigma akashiwo, Chaetoceros debilis, Skeletonema costatum, Prorocentrum donghaiense, and Nitzschia closterium as test species. Five species-specific (taxonomic) probes were designed from variable regions of the large subunit ribosomal DNA (LSU rDNA) by visualizing the alignment of LSU rDNA of related species. The specificity of the probes was confirmed by dot blot hybridization. The membrane-based DNA array was prepared by spotting the tailed taxonomic probes onto positively charged nylon membrane. Digoxigenin (Dig) labeling of target molecules was performed by multiple PCR/RT-PCR using RNA/DNA mixture of five microalgae as template. The Dig-labeled amplification products were hybridized with the membrane-based DNA array to produce visible hybridization signal indicating the presence of target algae. Detection sensitivity comparison showed that RT-PCR labeling (RPL) coupled with hybridization was tenfold more sensitive than DNA-PCR-labeling-coupled with hybridization. Finally, the effectiveness of RPL coupled with membrane-based DNA array was validated by testing with simulated and natural water samples, respectively. All of these results indicated that RPL coupled with membrane-based DNA array is specific, simple, and sensitive for parallel detection of microalgae which

  19. Design and implementation of a high performance network security processor

    NASA Astrophysics Data System (ADS)

    Wang, Haixin; Bai, Guoqiang; Chen, Hongyi

    2010-03-01

    The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.

  20. Even-odd mode excitation for stability investigation of Cartesian feedback amplifier used in parallel transmit array.

    PubMed

    Shooshtary, S; Solbach, K

    2015-08-01

    A 7 Tesla Magnetic Resonance Imaging (MRI) system with parallel transmission (pTx) for 32 near-magnet Cartesian feedback loop power amplifiers (PA) with output power of 1kW is under construction at Erwin L. Hahn Institute for Magnetic Resonance Imaging. Variation of load impedance due to mutual coupling of neighborhood coils in the array may lead to instability of the Cartesian feedback loop amplifier. MRI safety requires unconditional stability of the PAs at any load. In order to avoid instability in the pTx system, conditions and limits of stability have to be investigated for every possible excitation mode for the coil array. In this work, an efficient method of stability check for an array of two transmit channels (Tx) with Cartesian feedback loop amplifier and a selective excitation mode for the coil array is proposed which allows extension of stability investigations to a large pTx array with any arbitrary excitation mode for the coil array. PMID:26736573

  1. Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging.

    PubMed

    Pang, Yong; Yu, Baiying; Vigneron, Daniel B; Zhang, Xiaoliang

    2014-02-01

    Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than -35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

  2. Massively parallel visualization: Parallel rendering

    SciTech Connect

    Hansen, C.D.; Krogh, M.; White, W.

    1995-12-01

    This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume renderer use a MIMD approach. Implementations for these algorithms are presented for the Thinking Machines Corporation CM-5 MPP.

  3. A parallel processing VLSI BAM engine.

    PubMed

    Hasan, S R; Siong, N K

    1997-01-01

    In this paper emerging parallel/distributed architectures are explored for the digital VLSI implementation of adaptive bidirectional associative memory (BAM) neural network. A single instruction stream many data stream (SIMD)-based parallel processing architecture, is developed for the adaptive BAM neural network, taking advantage of the inherent parallelism in BAM. This novel neural processor architecture is named the sliding feeder BAM array processor (SLiFBAM). The SLiFBAM processor can be viewed as a two-stroke neural processing engine, It has four operating modes: learn pattern, evaluate pattern, read weight, and write weight. Design of a SLiFBAM VLSI processor chip is also described. By using 2-mum scalable CMOS technology, a SLiFBAM processor chip with 4+4 neurons and eight modules of 256x5 bit local weight-storage SRAM, was integrated on a 6.9x7.4 mm(2) prototype die. The system architecture is highly flexible and modular, enabling the construction of larger BAM networks of up to 252 neurons using multiple SLiFBAM chips.

  4. Analog Processor To Solve Optimization Problems

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

    1993-01-01

    Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.

  5. Graphene under one-dimensional periodic potentials using DNA-assembled parallel nanotubes as a periodic gate array

    NASA Astrophysics Data System (ADS)

    Wu, Yong; Han, Si-Ping; Goddard, William; Bockrath, Marc

    2015-03-01

    Graphene under an applied one-dimensional (1D) periodic potential is predicted to show many interesting and unique phenomena such as electron supercollimation and additional Dirac points, and some progress has been made in observing graphene in this regime. Here, we use parallel nanotubes assembled using DNA linkers as a back gate to apply periodic or quasi-periodic 1D potentials to graphene layers. The pitch of the nanotube array can be controlled by the linker length which we can vary from 8nm-20nm. We can independently control the periodic potentials using the nanotube array and the carrier density using a top gate to study the transport properties of the system. Our latest results will be discussed.

  6. Parallel optical coherence tomography in scattering samples using a two-dimensional smart-pixel detector array

    NASA Astrophysics Data System (ADS)

    Ducros, M.; Laubscher, M.; Karamata, B.; Bourquin, S.; Lasser, T.; Salathé, R. P.

    2002-02-01

    Parallel optical coherence tomography in scattering samples is demonstrated using a 58×58 smart-pixel detector array. A femtosecond mode-locked Ti:Sapphire laser in combination with a free space Michelson interferometer was employed to achieve 4 μm longitudinal resolution and 9 μm transverse resolution on a 260×260 μm2 field of view. We imaged a resolution target covered by an intralipid solution with different scattering coefficients as well as onion cells.

  7. Implementation and Assessment of Advanced Analog Vector-Matrix Processor

    NASA Technical Reports Server (NTRS)

    Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

    1994-01-01

    This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the

  8. Processor architecture of MBAP for embedded image understanding system

    NASA Astrophysics Data System (ADS)

    Liu, Peng; Yao, Qingdong; Wu, Song; Pan, Qiaohai; Lai, JinMei

    2001-03-01

    Processor's architecture has great effect on the performance of whole processor array. In order to improve the performance of SIMD array architecture, we modified the structure of BAP (bit-serial array processor) processing element based on the BAP128 processor. The array processor chip of modified bit-serial array processor (MBAP in abbreviation) with 0.35 micrometers CMOS technology is designed for embedded image understanding system. This paper not only presents MBAP architecture, but also gives the architecture feature about this design. Toward basic macro instructions and low-level processing algorithms of image understanding, the performance of BAP and MBAP is compared. The result shows that the performance of MBAP has much improvement on BAP, at the cost of increasing 5% chip resource.

  9. Real-Time Signal Processor for Pulsar Studies

    NASA Astrophysics Data System (ADS)

    Ramkumar, P. S.; Deshpande, A. A.

    2001-12-01

    This paper describes the design, tests and preliminary results of a real-time parallel signal processor built to aid a wide variety of pulsar observations. The signal processor reduces the distortions caused by the effects of dispersion, Faraday rotation, doppler acceleration and parallactic angle variations, at a sustained data rate of 32 Msamples/sec. It also folds the pulses coherently over the period and integrates adjacent samples in time and frequency to enhance the signal-to-noise ratio. The resulting data are recorded for further off-line analysis of the characteristics of pulsars and the intervening medium. The signal processing for analysis of pulsar signals is quite complex, imposing the need for a high computational throughput, typically of the order of a Giga operations per second (GOPS). Conventionally, the high computational demand restricts the flexibility to handle only a few types of pulsar observations. This instrument is designed to handle a wide variety of Pulsar observations with the Giant Metre Wave Radio Telescope (GMRT), and is flexible enough to be used in many other high-speed, signal processing applications. The technology used includes field-programmable-gate-array(FPGA) based data/code routing interfaces, PC-AT based control, diagnostics and data acquisition, digital signal processor (DSP) chip based parallel processing nodes and C language based control software and DSP-assembly programs for signal processing. The architecture and the software implementation of the parallel processor are fine-tuned to realize about 60 MOPS per DSP node and a multiple-instruction-multiple-data (MIMD) capability.

  10. Kokkos Array

    SciTech Connect

    Edwards Daniel Sunderland, Harold Carter

    2012-09-12

    The Kokkos Array library implements shared-memory array data structures and parallel task dispatch interfaces for data-parallel computational kernels that are performance-portable to multicore-CPU and manycore-accelerator (e.g., GPGPU) devices.

  11. Parallel grid population

    SciTech Connect

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  12. Disposable micro-fluidic biosensor array for online parallelized cell adhesion kinetics analysis on quartz crystal resonators

    NASA Astrophysics Data System (ADS)

    Cama, G.; Jacobs, T.; Dimaki, M. I.; Svendsen, W. E.; Hauptmann, P.; Naumann, M.

    2010-08-01

    In this contribution we present a new disposable micro-fluidic biosensor array for the online analysis of adherent Madin Darby canine kidney (MDCK-II) cells on quartz crystal resonators (QCRs). The device was conceived for the parallel cultivation of cells providing the same experimental conditions among all the sensors of the array. As well, dedicated sensor interface electronics were developed and optimized for fast spectra acquisition of all 16 QCRs with a miniaturized impedance analyzer. This allowed performing cell cultivation experiments for the observation of fast cellular reaction kinetics with focus on the comparison of the resulting sensor signals influenced by different cell distributions on the sensor surface. To prove the assumption of equal flow circulation within the symmetric micro-channel network and support the hypothesis of identical cultivation conditions for the cells living above the sensors, the influence of fabrication tolerances on the flow regime has been simulated. As well, the shear stress on the adherent cell layer due to the flowing media was characterized. Injection molding technology was chosen for the cheap mass production of disposable devices. Furthermore, the injection molding process was simulated in order to optimize the mold geometry and minimize the shrinkage and the warpage of the parts. MDCK-II cells were cultivated in the biosensor array. Parallel cultivation of cells on the gold surface of the QCRs led to first observations of the impact of the cell distribution on the sensor signals during cell cultivation. Indeed, the initial cell distribution revealed a significant influence on the changes in the measured acoustic load on the QCRs suggesting dissimilar cell migrations as well as proliferation kinetics of a non-confluent MDCK-II cell layer.

  13. Parallel acquisition of Raman spectra from a 2D multifocal array using a modulated multifocal detection scheme

    NASA Astrophysics Data System (ADS)

    Kong, Lingbo; Chan, James W.

    2015-03-01

    A major limitation of spontaneous Raman scattering is its intrinsically weak signals, which makes Raman analysis or imaging of biological specimens slow and impractical for many applications. To address this, we report the development of a novel modulated multifocal detection scheme for simultaneous acquisition of full Raman spectra from a 2-D m × n multifocal array. A spatial light modulator (SLM), or a pair of galvo-mirrors, is used to generate m × n laser foci. Raman signals generated within each focus are projected simultaneously into a spectrometer and detected by a CCD camera. The system can resolve the Raman spectra with no crosstalk along the vertical pixels of the CCD camera, e.g., along the entrance slit of the spectrometer. However, there is significant overlap of the spectra in the horizontal pixel direction, e.g., along the dispersion direction. By modulating the excitation multifocal array (illumination modulation) or the emitted Raman signal array (detection modulation), the superimposed Raman spectra of different multifocal patterns are collected. The individual Raman spectrum from each focus is then retrieved from the superimposed spectra using a postacquisition data processing algorithm. This development leads to a significant improvement in the speed of acquiring Raman spectra. We discuss the application of this detection scheme for parallel analysis of individual cells with multifocus laser tweezers Raman spectroscopy (M-LTRS) and for rapid confocal hyperspectral Raman imaging.

  14. Electro-optical microwave signal processor for high-frequency wideband frequency channelization

    NASA Astrophysics Data System (ADS)

    Dawber, William N.; Webster, Ken

    1998-08-01

    An electro-optic microwave signal processor for activity monitoring in an electronic warfare receiver, offering wideband operation, parallel output in real time and 100 percent probability of intercept is presented, along with results from a prototype system. Requirements on electronic warfare receiver system are demanding, because they have to defect and identify potential threats across a large frequency bandwidth and in the high pulse density expected of the battlefield environment. A technique of processing signals across a wide bandwidth is to use a channelizer in the receiver front-end, in order to produce a number of narrow band outputs that can be individually processed. In the presented signal processor, received microwave signals ar unconverted onto an optical carrier using an electro- optic modulator and then spatially separated into a series of spots. The position and intensity of the spots is determined by the received signal(s) frequency and strength. Finally a photodiode array can be used for fast parallel data readout. Thus the signal processor output is fully channelized according to frequency. A prototype signal processor has been constructed, which can process microwave frequencies from 500MHz to 8GHz. A standard telecommunications electro-optic intensity modulator with a 3dB bandwidth of approximately 2.5GHz provides frequency upconversion. Readout is achieved using either a near IR camera or a 16 element linear photodiode array.

  15. Pin-Hole Array Correlation Imaging: Highly Parallel Fluorescence Correlation Spectroscopy

    PubMed Central

    Needleman, Daniel J.; Xu, Yangqing; Mitchison, Timothy J.

    2009-01-01

    Abstract In this work, we describe pin-hole array correlation imaging, a multipoint version of fluorescence correlation spectroscopy, based upon a stationary Nipkow disk and a high-speed electron multiplying charged coupled detector. We characterize the system and test its performance on a variety of samples, including 40 nm colloids, a fluorescent protein complex, a membrane dye, and a fluorescence fusion protein. Our results demonstrate that pin-hole array correlation imaging is capable of simultaneously performing tens or hundreds of fluorescence correlation spectroscopy-style measurements in cells, with sufficient sensitivity and temporal resolution to study the behaviors of membrane-bound and soluble molecules labeled with conventional chemical dyes or fluorescent proteins. PMID:19527665

  16. TRIM46 Controls Neuronal Polarity and Axon Specification by Driving the Formation of Parallel Microtubule Arrays.

    PubMed

    van Beuningen, Sam F B; Will, Lena; Harterink, Martin; Chazeau, Anaël; van Battum, Eljo Y; Frias, Cátia P; Franker, Mariella A M; Katrukha, Eugene A; Stucchi, Riccardo; Vocking, Karin; Antunes, Ana T; Slenders, Lotte; Doulkeridou, Sofia; Sillevis Smitt, Peter; Altelaar, A F Maarten; Post, Jan A; Akhmanova, Anna; Pasterkamp, R Jeroen; Kapitein, Lukas C; de Graaff, Esther; Hoogenraad, Casper C

    2015-12-16

    Axon formation, the initial step in establishing neuronal polarity, critically depends on local microtubule reorganization and is characterized by the formation of parallel microtubule bundles. How uniform microtubule polarity is achieved during axonal development remains an outstanding question. Here, we show that the tripartite motif containing (TRIM) protein TRIM46 plays an instructive role in the initial polarization of neuronal cells. TRIM46 is specifically localized to the newly specified axon and, at later stages, partly overlaps with the axon initial segment (AIS). TRIM46 specifically forms closely spaced parallel microtubule bundles oriented with their plus-end out. Without TRIM46, all neurites have a dendrite-like mixed microtubule organization resulting in Tau missorting and altered cargo trafficking. By forming uniform microtubule bundles in the axon, TRIM46 is required for neuronal polarity and axon specification in vitro and in vivo. Thus, TRIM46 defines a unique axonal cytoskeletal compartment for regulating microtubule organization during neuronal development.

  17. Experimental Parallel-Processing Computer

    NASA Technical Reports Server (NTRS)

    Mcgregor, J. W.; Salama, M. A.

    1986-01-01

    Master processor supervises slave processors, each with its own memory. Computer with parallel processing serves as inexpensive tool for experimentation with parallel mathematical algorithms. Speed enhancement obtained depends on both nature of problem and structure of algorithm used. In parallel-processing architecture, "bank select" and control signals determine which one, if any, of N slave processor memories accessible to master processor at any given moment. When so selected, slave memory operates as part of master computer memory. When not selected, slave memory operates independently of main memory. Slave processors communicate with each other via input/output bus.

  18. Database Reorganization in Parallel Disk Arrays with I/O Service Stealing

    NASA Technical Reports Server (NTRS)

    Zabback, Peter; Onyuksel, Ibrahim; Scheuermann, Peter; Weikum, Gerhard

    1996-01-01

    We present a model for data reorganization in parallel disk systems that is geared towards load balancing in an environment with periodic access patterns. Data reorganization is performed by disk cooling, i.e. migrating files or extents from the hottest disks to the coldest ones. We develop an approximate queueing model for determining the effective arrival rates of cooling requests and discuss its use in assessing the costs versus benefits of cooling.

  19. Critical parameters for parallel interconnects using VCSEL arrays and fiber image guides

    NASA Astrophysics Data System (ADS)

    Mukherjee, Sayan D.; Hadley, G. Ronald; Geib, Kent M.; Choquette, Kent D.; Carter, Tony R.; Fischer, Arthur J.; Robinson, Matthew; Sullivan, Charles T.

    2003-04-01

    Several thousand glass optical fibers fused together is routinely used as fiber image guides for medical and other image remoting applications. Fiber image guides also offer possibility for flexible optical interconnect links with potentially thousands of bi-directional parallel channels with data rates as high as 10 Gbps per channel, leading to more than Tera bits per second aggregate data transfer rates. A fair number of fiber image guide based link demonstrations using vertical cavity surface emitting lasers have been reported. However, little is known about designable parameters and optimization paradigms for applications to massively parallel optical interconnects. This paper discusses critical optical parameters that characterize a massively parallel link. Experimental characterizations were carried out to explore some of the fundamental interactions between single-mode 850 nm VCSELs and fiber image guides having different numerical apertures, 0.25, 0.55 and 1.00. Preliminary optical simulation results are given. Finally, potential directions for further experimental and analytical explorations, and for applicability into designable link systems are suggested.

  20. Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems

    PubMed Central

    Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

    2011-01-01

    A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

  1. Top-down designs of instruction systolic arrays for polynomial interpolation and evaluation

    SciTech Connect

    Schroder, H. )

    1989-06-01

    This paper describes the application of a new parallel architecture-instruction systolic array (ISA)-for the interpolation and evaluation of polynomials using a linear array of processors. It also demonstrates a systemic top-down design of instruction systolic arrays. The periods of the resulting algorithms are O(n) for interpolation and O(1)for evaluation, where n is the degree of the polynomial.

  2. Advanced parallel processing with supercomputer architectures

    SciTech Connect

    Hwang, K.

    1987-10-01

    This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers.

  3. Optimal expression evaluation for data parallel architectures

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    A data parallel machine represents an array or other composite data structure by allocating one processor (at least conceptually) per data item. A pointwise operation can be performed between two such arrays in unit time, provided their corresponding elements are allocated in the same processors. If the arrays are not aligned in this fashion, the cost of moving one or both of them is part of the cost of the operation. The choice of where to perform the operation then affects this cost. If an expression with several operands is to be evaluated, there may be many choices of where to perform the intermediate operations. An efficient algorithm is given to find the minimum-cost way to evaluate an expression, for several different data parallel architectures. This algorithm applies to any architecture in which the metric describing the cost of moving an array is robust. This encompasses most of the common data parallel communication architectures, including meshes of arbitrary dimension and hypercubes. Remarks are made on several variations of the problem, some of which are solved and some of which remain open.

  4. Supercomputing on massively parallel bit-serial architectures

    NASA Technical Reports Server (NTRS)

    Iobst, Ken

    1985-01-01

    Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.

  5. Opto-electronic morphological processor

    NASA Technical Reports Server (NTRS)

    Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

    1993-01-01

    The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

  6. Parallel rendering techniques for massively parallel visualization

    SciTech Connect

    Hansen, C.; Krogh, M.; Painter, J.

    1995-07-01

    As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.

  7. Design and optimization of multi-class series-parallel linear electromagnetic array artificial muscle.

    PubMed

    Li, Jing; Ji, Zhenyu; Shi, Xuetao; You, Fusheng; Fu, Feng; Liu, Ruigang; Xia, Junying; Wang, Nan; Bai, Jing; Wang, Zhanxi; Qin, Xiansheng; Dong, Xiuzhen

    2014-01-01

    Skeletal muscle exhibiting complex and excellent precision has evolved for millions of years. Skeletal muscle has better performance and simpler structure compared with existing driving modes. Artificial muscle may be designed by analyzing and imitating properties and structure of skeletal muscle based on bionics, which has been focused on by bionic researchers, and a structure mode of linear electromagnetic array artificial muscle has been designed in this paper. Half sarcomere is the minimum unit of artificial muscle and electromagnetic model has been built. The structural parameters of artificial half sarcomere actuator were optimized to achieve better movement performance. Experimental results show that artificial half sarcomere actuator possesses great motion performance such as high response speed, great acceleration, small weight and size, robustness, etc., which presents a promising application prospect of artificial half sarcomere actuator.

  8. High-resolution parallel-detection sensor array using piezo-phototronics effect

    SciTech Connect

    Wang, Zhong L.; Pan, Caofeng

    2015-07-28

    A pressure sensor element includes a substrate, a first type of semiconductor material layer and an array of elongated light-emitting piezoelectric nanostructures extending upwardly from the first type of semiconductor material layer. A p-n junction is formed between each nanostructure and the first type semiconductor layer. An insulative resilient medium layer is infused around each of the elongated light-emitting piezoelectric nanostructures. A transparent planar electrode, disposed on the resilient medium layer, is electrically coupled to the top of each nanostructure. A voltage source is coupled to the first type of semiconductor material layer and the transparent planar electrode and applies a biasing voltage across each of the nanostructures. Each nanostructure emits light in an intensity that is proportional to an amount of compressive strain applied thereto.

  9. Electromagnetic energy and energy flows in photonic crystals made of arrays of parallel dielectric cylinders.

    PubMed

    Kuo, Chao-Hsien; Ye, Zhen

    2004-10-01

    We consider electromagnetic propagation in two-dimensional photonic crystals, formed by parallel dielectric cylinders embedded a uniform medium. The frequency band structure is computed using the standard plane-wave expansion method, and the corresponding eigenmodes are obtained subsequently. The optical flows of the eigenmodes are calculated by a direct computation approach, and several averaging schemes of the energy current are discussed. The results are compared to those obtained by the usual approach that employs a group velocity calculation. We consider both the case in which the frequency lies within passing band and the situation in which the frequency is in the range of a partial band gap. The agreements and discrepancies between various averaging schemes and the group velocity approach are discussed in detail. The results indicate that the group velocity can be obtained by an appropriate averaging method. Existing experimental methods are also discussed.

  10. Parallel image-acquisition in continuous-wave electron paramagnetic resonance imaging with a surface coil array: Proof-of-concept experiments.

    PubMed

    Enomoto, Ayano; Hirata, Hiroshi

    2014-02-01

    This article describes a feasibility study of parallel image-acquisition using a two-channel surface coil array in continuous-wave electron paramagnetic resonance (CW-EPR) imaging. Parallel EPR imaging was performed by multiplexing of EPR detection in the frequency domain. The parallel acquisition system consists of two surface coil resonators and radiofrequency (RF) bridges for EPR detection. To demonstrate the feasibility of this method of parallel image-acquisition with a surface coil array, three-dimensional EPR imaging was carried out using a tube phantom. Technical issues in the multiplexing method of EPR detection were also clarified. We found that degradation in the signal-to-noise ratio due to the interference of RF carriers is a key problem to be solved.

  11. Photorefractive processing for large adaptive phased arrays.

    PubMed

    Weverka, R T; Wagner, K; Sarto, A

    1996-03-10

    An adaptive null-steering phased-array optical processor that utilizes a photorefractive crystal to time integrate the adaptive weights and null out correlated jammers is described. This is a beam-steering processor in which the temporal waveform of the desired signal is known but the look direction is not. The processor computes the angle(s) of arrival of the desired signal and steers the array to look in that direction while rotating the nulls of the antenna pattern toward any narrow-band jammers that may be present. We have experimentally demonstrated a simplified version of this adaptive phased-array-radar processor that nulls out the narrow-band jammers by using feedback-correlation detection. In this processor it is assumed that we know a priori only that the signal is broadband and the jammers are narrow band. These are examples of a class of optical processors that use the angular selectivity of volume holograms to form the nulls and look directions in an adaptive phased-array-radar pattern and thereby to harness the computational abilities of three-dimensional parallelism in the volume of photorefractive crystals. The development of this processing in volume holographic system has led to a new algorithm for phased-array-radar processing that uses fewer tapped-delay lines than does the classic time-domain beam former. The optical implementation of the new algorithm has the further advantage of utilization of a single photorefractive crystal to implement as many as a million adaptive weights, allowing the radar system to scale to large size with no increase in processing hardware.

  12. Scripts for Scalable Monitoring of Parallel Filesystem Infrastructure

    2014-02-27

    Scripts for scalable monitoring of parallel filesystem infrastructure provide frameworks for monitoring the health of block storage arrays and large InfiniBand fabrics. The block storage framework uses Python multiprocessing to within scale the number monitored arrays to scale with the number of processors in the system. This enables live monitoring of HPC-scale filesystem with 10-50 storage arrays. For InfiniBand monitoring, there are scripts included that monitor InfiniBand health of each host along with visualization toolsmore » for mapping the topology of complex fabric topologies.« less

  13. Scripts for Scalable Monitoring of Parallel Filesystem Infrastructure

    SciTech Connect

    Caldwell, Blake

    2014-02-27

    Scripts for scalable monitoring of parallel filesystem infrastructure provide frameworks for monitoring the health of block storage arrays and large InfiniBand fabrics. The block storage framework uses Python multiprocessing to within scale the number monitored arrays to scale with the number of processors in the system. This enables live monitoring of HPC-scale filesystem with 10-50 storage arrays. For InfiniBand monitoring, there are scripts included that monitor InfiniBand health of each host along with visualization tools for mapping the topology of complex fabric topologies.

  14. SCAN secure processor and its biometric capabilities

    NASA Astrophysics Data System (ADS)

    Kannavara, Raghudeep; Mertoguno, Sukarno; Bourbakis, Nikolaos

    2011-04-01

    This paper presents the design of the SCAN secure processor and its extended instruction set to enable secure biometric authentication. The SCAN secure processor is a modified SparcV8 processor architecture with a new instruction set to handle voice, iris, and fingerprint-based biometric authentication. The algorithms for processing biometric data are based on the local global graph methodology. The biometric modules are synthesized in reconfigurable logic and the results of the field-programmable gate array (FPGA) synthesis are presented. We propose to implement the above-mentioned modules in an off-chip FPGA co-processor. Further, the SCAN-secure processor will offer a SCAN-based encryption and decryption of 32 bit instructions and data.

  15. Online track processor for the CDF upgrade

    SciTech Connect

    E. J. Thomson et al.

    2002-07-17

    A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented.

  16. Neurovision processor for designing intelligent sensors

    NASA Astrophysics Data System (ADS)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  17. Reconfigurable VLSI architecture for a database processor

    SciTech Connect

    Oflazer, K.

    1983-01-01

    This work brings together the processing potential offered by regularly structured VLSI processing units and the architecture of a database processor-the relational associative processor (RAP). The main motivations are to integrate a RAP cell processor on a few VLSI chips and improve performance by employing procedures exploiting these VLSI chips and the system level reconfigurability of processing resources. The resulting VLSI database processor consists of parallel processing cells that can be reconfigured into a large processor to execute the hard operations of projection and semijoin efficiently. It is shown that such a configuration can provide 2 to 3 orders of magnitude of performance improvement over previous implementations of the RAP system in the execution of such operations. 27 refs.

  18. Fault-tolerant computer architecture based on INMOS transputer processor

    NASA Technical Reports Server (NTRS)

    Ortiz, Jorge L.

    1987-01-01

    Redundant processing was used for several years in mission flight systems. In these systems, more than one processor performs the same task at the same time but only one processor is actually in real use. A fault-tolerance computer architecture based on the features provided by INMOS Transputers is presented. The Transputer architecture provides several communication links that allow data and command communication with other Transputers without the use of a bus. Additionally the Transputer allows the use of parallel processing to increase the system speed considerably. The processor architecture consists of three processors working in parallel keeping all the processors at the same operational level but only one processor is in real control of the process. The design allows each Transputer to perform a test to the other two Transputers and report the operating condition of the neighboring processors. A graphic display was developed to facilitate the identification of any problem by the user.

  19. Implementing Access to Data Distributed on Many Processors

    NASA Technical Reports Server (NTRS)

    James, Mark

    2006-01-01

    A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of

  20. NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

    PubMed

    Cheung, Kit; Schultz, Simon R; Luk, Wayne

    2015-01-01

    NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542

  1. NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors

    PubMed Central

    Cheung, Kit; Schultz, Simon R.; Luk, Wayne

    2016-01-01

    NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542

  2. Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors

    SciTech Connect

    Aghili Yajadda, Mir Massoud

    2014-10-21

    We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV–50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

  3. Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors

    NASA Astrophysics Data System (ADS)

    Aghili Yajadda, Mir Massoud

    2014-10-01

    We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV-50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

  4. Highly Parallel Computing Architectures by using Arrays of Quantum-dot Cellular Automata (QCA): Opportunities, Challenges, and Recent Results

    NASA Technical Reports Server (NTRS)

    Fijany, Amir; Toomarian, Benny N.

    2000-01-01

    There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA

  5. Pipeline and parallel architectures for computer communication systems

    SciTech Connect

    Reddi, A.V.

    1983-01-01

    Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.

  6. Parallel asynchronous systems and image processing algorithms

    NASA Technical Reports Server (NTRS)

    Coon, D. D.; Perera, A. G. U.

    1989-01-01

    A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.

  7. A cost-effective methodology for the design of massively-parallel VLSI functional units

    NASA Technical Reports Server (NTRS)

    Venkateswaran, N.; Sriram, G.; Desouza, J.

    1993-01-01

    In this paper we propose a generalized methodology for the design of cost-effective massively-parallel VLSI Functional Units. This methodology is based on a technique of generating and reducing a massive bit-array on the mask-programmable PAcube VLSI array. This methodology unifies (maintains identical data flow and control) the execution of complex arithmetic functions on PAcube arrays. It is highly regular, expandable and uniform with respect to problem-size and wordlength, thereby reducing the communication complexity. The memory-functional unit interface is regular and expandable. Using this technique functional units of dedicated processors can be mask-programmed on the naked PAcube arrays, reducing the turn-around time. The production cost of such dedicated processors can be drastically reduced since the naked PAcube arrays can be mass-produced. Analysis of the the performance of functional units designed by our method yields promising results.

  8. Quantitative analysis of RNA-protein interactions on a massively parallel array for mapping biophysical and evolutionary landscapes

    PubMed Central

    Buenrostro, Jason D.; Chircus, Lauren M.; Araya, Carlos L.; Layton, Curtis J.; Chang, Howard Y.; Snyder, Michael P.; Greenleaf, William J.

    2015-01-01

    RNA-protein interactions drive fundamental biological processes and are targets for molecular engineering, yet quantitative and comprehensive understanding of the sequence determinants of affinity remains limited. Here we repurpose a high-throughput sequencing instrument to quantitatively measure binding and dissociation of MS2 coat protein to >107 RNA targets generated on a flow-cell surface by in situ transcription and inter-molecular tethering of RNA to DNA. We decompose the binding energy contributions from primary and secondary RNA structure, finding that differences in affinity are often driven by sequence-specific changes in association rates. By analyzing the biophysical constraints and modeling mutational paths describing the molecular evolution of MS2 from low- to high-affinity hairpins, we quantify widespread molecular epistasis, and a long-hypothesized structure-dependent preference for G:U base pairs over C:A intermediates in evolutionary trajectories. Our results suggest that quantitative analysis of RNA on a massively parallel array (RNAMaP) relationships across molecular variants. PMID:24727714

  9. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    SciTech Connect

    Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee; Buckner, Mark A

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core

  10. FFT Computation with Systolic Arrays, A New Architecture

    NASA Technical Reports Server (NTRS)

    Boriakoff, Valentin

    1994-01-01

    The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.

  11. Sandia secure processor : a native Java processor.

    SciTech Connect

    Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

    2003-08-01

    The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

  12. Image processing on MPP-like arrays

    SciTech Connect

    Coletti, N.B.

    1983-01-01

    The desirability and suitability of using very large arrays of processors such as the Massively Parallel Processor (MPP) for processing remotely sensed images is investigated. The dissertation can be broken into two areas. The first area is the mathematical analysis of emultating the Bitonic Sorting Network on an array of processors. This sort is useful in histogramming images that have a very large number of pixel values (or gray levels). The optimal number of routing steps required to emulate a N = 2/sup k/ x 2/sup k/ element network on a 2/sup n/ x 2/sup n/ array (k less than or equal to n less than or equal to 7), provided each processor contains one element before and after every merge sequence, is proved to be 14 ..sqrt..N - 4log/sub 2/N - 14. Several already existing emulations achieve this lower bound. The number of elements sorted dictates a particular sorting network, and hence the number of routing steps. It is established that the cardinality N = 3/4 x 2/sup 2n/ elements used the absolute minimum routing steps, 8 ..sqrt..3 ..sqrt..N -4log/sub 2/N - (20 - 4log/sub 2/3). An algorithm achieving this bound is presented. The second area covers the implementations of the image processing tasks. In particular the histogramming of large numbers of gray-levels, geometric distortion determination and its efficient correction, fast Fourier transforms, and statistical clustering are investigated.

  13. Dynamically scalable dual-core pipelined processor

    NASA Astrophysics Data System (ADS)

    Kumar, Nishant; Aggrawal, Ekta; Rajawat, Arvind

    2015-10-01

    This article proposes design and architecture of a dynamically scalable dual-core pipelined processor. Methodology of the design is the core fusion of two processors where two independent cores can dynamically morph into a larger processing unit, or they can be used as distinct processing elements to achieve high sequential performance and high parallel performance. Processor provides two execution modes. Mode1 is multiprogramming mode for execution of streams of instruction of lower data width, i.e., each core can perform 16-bit operations individually. Performance is improved in this mode due to the parallel execution of instructions in both the cores at the cost of area. In mode2, both the processing cores are coupled and behave like single, high data width processing unit, i.e., can perform 32-bit operation. Additional core-to-core communication is needed to realise this mode. The mode can switch dynamically; therefore, this processor can provide multifunction with single design. Design and verification of processor has been done successfully using Verilog on Xilinx 14.1 platform. The processor is verified in both simulation and synthesis with the help of test programs. This design aimed to be implemented on Xilinx Spartan 3E XC3S500E FPGA.

  14. Optimal processor assignment for pipeline computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

    1991-01-01

    The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

  15. Graph-Based Dynamic Assignment Of Multiple Processors

    NASA Technical Reports Server (NTRS)

    Hayes, Paul J.; Andrews, Asa M.

    1994-01-01

    Algorithm-to-architecture mapping model (ATAMM) is strategy minimizing time needed to periodically execute graphically described, data-driven application algorithm on multiple data processors. Implemented as operating system managing flow of data and dynamically assigns nodes of graph to processors. Predicts throughput versus number of processors available to execute given application algorithm. Includes rules ensuring application algorithm represented by graph executed periodically without deadlock and in shortest possible repetition time. ATAMM proves useful in maximizing effectiveness of parallel computing systems.

  16. Magnetic arrays

    DOEpatents

    Trumper, David L.; Kim, Won-jong; Williams, Mark E.

    1997-05-20

    Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

  17. Magnetic arrays

    DOEpatents

    Trumper, D.L.; Kim, W.; Williams, M.E.

    1997-05-20

    Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

  18. Track Segment Finding with the CDFII Online Track Processor

    NASA Astrophysics Data System (ADS)

    Neu, Christopher

    2000-04-01

    With increased accelerator luminosity and detector upgrades, Run II at the Tevatron offers not only unprecedented physics opportunities, but also exciting new technical challenges. At CDF, the new Central Outer Tracker (COT) coupled with the decreased bunch spacing requires the design of a new track processor to identify tracks in the central detector. This critical component of the triggering system must be efficient, fast and accurate. The eXtremely Fast Tracker (XFT) meets these criteria. The XFT is divided into two major subsystems, the segment finder and the segment linker. We report on the XFT's role in the Level 1 triggering system at CDF and the Finder subsytem. The Finder identifies track segments within a 12-wire layer of the COT. The device is highly parallel and makes use of field programmable gate arrays. The design, testing and commissioning of the Finder are detailed.

  19. Parallel architectures for iterative methods on adaptive, block structured grids

    NASA Technical Reports Server (NTRS)

    Gannon, D.; Vanrosendale, J.

    1983-01-01

    A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.

  20. Silicon on-chip 1D photonic crystal nanobeam bandstop filters for the parallel multiplexing of ultra-compact integrated sensor array.

    PubMed

    Yang, Daquan; Wang, Chuan; Ji, Yuefeng

    2016-07-25

    We propose a novel multiplexed ultra-compact high-sensitivity one-dimensional (1D) photonic crystal (PC) nanobeam cavity sensor array on a monolithic silicon chip, referred to as Parallel Integrated 1D PC Nanobeam Cavity Sensor Array (PI-1DPC-NCSA). The performance of the device is investigated numerically with three-dimensional finite-difference time-domain (3D-FDTD) technique. The PI-1DPC-NCSA consists of multiple parallel-connected channels of integrated 1D PC nanobeam cavities/waveguides with gap separations. On each channel, by connecting two additional 1D PC nanobeam bandstop filters (1DPC-NBFs) to a 1D PC nanobeam cavity sensor (1DPC-NCS) in series, a transmission spectrum with a single targeted resonance is achieved for the purpose of multiplexed sensing applications. While the other spurious resonances are filtered out by the stop-band of 1DPC-NBF, multiple 1DPC-NCSs at different resonances can be connected in parallel without spectrum overlap. Furthermore, in order for all 1DPC-NCSs to be integrated into microarrays and to be interrogated simultaneously with a single input/output port, all channels are then connected in parallel by using a 1 × n taper-type equal power splitter and a n × 1 S-type power combiner in the input port and output port, respectively (n is the channel number). The concept model of PI-1DPC-NCSA is displayed with a 3-parallel-channel 1DPC-NCSs array containing series-connected 1DPC-NBFs. The bulk refractive index sensitivities as high as 112.6nm/RIU, 121.7nm/RIU, and 148.5nm/RIU are obtained (RIU = Refractive Index Unit). In particular, the footprint of the 3-parallel-channel PI-1DPC-NCSA is 4.5μm × 50μm (width × length), decreased by more than three orders of magnitude compared to 2D PC integrated sensor arrays. Thus, this is a promising platform for realizing ultra-compact lab-on-a-chip applications with high integration density and high parallel-multiplexing capabilities. PMID:27464080

  1. Acoustooptic processor for adaptive radar noise environment characterization.

    PubMed

    Goutzoulis, A P; Casasent, D; Kumar, B V

    1984-12-01

    A new 2-D acoustooptic processor that estimates the angular as well as spectral distributions of jammers in the far field of an adaptive phased array radar is described. The operating modes of the system are discussed together with the estimation accuracy achieved. Experimental results are presented to illustrate the operation of the processor, and different acoustooptic cell operating modes are discussed.

  2. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    SciTech Connect

    Robert W. Numrich

    2008-04-22

    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the

  3. A Double Precision High Speed Convolution Processor

    NASA Astrophysics Data System (ADS)

    Larochelle, F.; Coté, J. F.; Malowany, A. S.

    1989-11-01

    There exist several convolution processors on the market that can process images at video rate. However, none of these processors operates in floating point arithmetic. Unfortunately, many image processing algorithms presently under development are inoperable in integer arithmetic, forcing the researchers to use regular computers. To solve this problem, we designed a specialized convolution processor that operates in double precision floating point arithmetic with a throughput several thousand times faster than the one obtained on regular computer. Its high performance is attributed to a VLSI double precision convolution systolic cell designed in our laboratories. A 9X9 systolic array carries out, in a pipeline manner, every arithmetic operation. The processor is designed to interface directly with the VME Bus. A DMA chip is responsible for bringing the original pixel intensities from the memory of the computer to the systolic array and to return the convolved pixels back to memory. A special use of 8K RAMs allows an inexpensive and efficient way of delaying the pixel intensities in order to supply the right sequence to the systolic array. On board circuitry converts pixel values into floating point representation when the image is originally represented with integer values. An additional systolic cell, used as a pipeline adder at the output of the systolic array, offers the possibility of combining images together which allows a variable convolution window size and color image processing.

  4. Processor-Group Aware Runtime Support for Shared-and Global-Address Space Models

    SciTech Connect

    Krishnan, Manoj Kumar; Tipparaju, Vinod; Palmer, Bruce; Nieplocha, Jarek

    2004-12-07

    Exploiting multilevel parallelism using processor groups is becoming increasingly important for programming on high-end systems. This paper describes a group-aware run-time support for shared-/global- address space programming models. The current effort has been undertaken in the context of the Aggregate Remote Memory Copy Interface (ARMCI) [5], a portable runtime system used as a communication layer for Global Arrays [6], Co-Array Fortran (CAF) [9], GPSHMEM [10], Co-Array Python [11], and also end-user applications. The paper describes the management of shared memory, integration of shared memory communication and RDMA on clusters with SMP nodes, and registration. These are all required for efficient multi- method and multi-protocol communication on modern systems. Focus is placed on techniques for supporting process groups while maximizing communication performance and efficiently managing global memory system-wide.

  5. ALMA Correlator Real-Time Data Processor

    NASA Astrophysics Data System (ADS)

    Pisano, J.; Amestica, R.; Perez, J.

    2005-10-01

    The design of a real-time Linux application utilizing Real-Time Application Interface (RTAI) to process real-time data from the radio astronomy correlator for the Atacama Large Millimeter Array (ALMA) is described. The correlator is a custom-built digital signal processor which computes the cross-correlation function of two digitized signal streams. ALMA will have 64 antennas with 2080 signal streams each with a sample rate of 4 giga-samples per second. The correlator's aggregate data output will be 1 gigabyte per second. The software is defined by hard deadlines with high input and processing data rates, while requiring interfaces to non real-time external computers. The designed computer system - the Correlator Data Processor or CDP, consists of a cluster of 17 SMP computers, 16 of which are compute nodes plus a master controller node all running real-time Linux kernels. Each compute node uses an RTAI kernel module to interface to a 32-bit parallel interface which accepts raw data at 64 megabytes per second in 1 megabyte chunks every 16 milliseconds. These data are transferred to tasks running on multiple CPUs in hard real-time using RTAI's LXRT facility to perform quantization corrections, data windowing, FFTs, and phase corrections for a processing rate of approximately 1 GFLOPS. Highly accurate timing signals are distributed to all seventeen computer nodes in order to synchronize them to other time-dependent devices in the observatory array. RTAI kernel tasks interface to the timing signals providing sub-millisecond timing resolution. The CDP interfaces, via the master node, to other computer systems on an external intra-net for command and control, data storage, and further data (image) processing. The master node accesses these external systems utilizing ALMA Common Software (ACS), a CORBA-based client-server software infrastructure providing logging, monitoring, data delivery, and intra-computer function invocation. The software is being developed in tandem

  6. Rapid geodesic mapping of brain functional connectivity: implementation of a dedicated co-processor in a field-programmable gate array (FPGA) and application to resting state functional MRI.

    PubMed

    Minati, Ludovico; Cercignani, Mara; Chan, Dennis

    2013-10-01

    Graph theory-based analyses of brain network topology can be used to model the spatiotemporal correlations in neural activity detected through fMRI, and such approaches have wide-ranging potential, from detection of alterations in preclinical Alzheimer's disease through to command identification in brain-machine interfaces. However, due to prohibitive computational costs, graph-based analyses to date have principally focused on measuring connection density rather than mapping the topological architecture in full by exhaustive shortest-path determination. This paper outlines a solution to this problem through parallel implementation of Dijkstra's algorithm in programmable logic. The processor design is optimized for large, sparse graphs and provided in full as synthesizable VHDL code. An acceleration factor between 15 and 18 is obtained on a representative resting-state fMRI dataset, and maps of Euclidean path length reveal the anticipated heterogeneous cortical involvement in long-range integrative processing. These results enable high-resolution geodesic connectivity mapping for resting-state fMRI in patient populations and real-time geodesic mapping to support identification of imagined actions for fMRI-based brain-machine interfaces. PMID:23746911

  7. Rapid geodesic mapping of brain functional connectivity: implementation of a dedicated co-processor in a field-programmable gate array (FPGA) and application to resting state functional MRI.

    PubMed

    Minati, Ludovico; Cercignani, Mara; Chan, Dennis

    2013-10-01

    Graph theory-based analyses of brain network topology can be used to model the spatiotemporal correlations in neural activity detected through fMRI, and such approaches have wide-ranging potential, from detection of alterations in preclinical Alzheimer's disease through to command identification in brain-machine interfaces. However, due to prohibitive computational costs, graph-based analyses to date have principally focused on measuring connection density rather than mapping the topological architecture in full by exhaustive shortest-path determination. This paper outlines a solution to this problem through parallel implementation of Dijkstra's algorithm in programmable logic. The processor design is optimized for large, sparse graphs and provided in full as synthesizable VHDL code. An acceleration factor between 15 and 18 is obtained on a representative resting-state fMRI dataset, and maps of Euclidean path length reveal the anticipated heterogeneous cortical involvement in long-range integrative processing. These results enable high-resolution geodesic connectivity mapping for resting-state fMRI in patient populations and real-time geodesic mapping to support identification of imagined actions for fMRI-based brain-machine interfaces.

  8. Unstructured Adaptive Grid Computations on an Array of SMPs

    NASA Technical Reports Server (NTRS)

    Biswas, Rupak; Pramanick, Ira; Sohn, Andrew; Simon, Horst D.

    1996-01-01

    Dynamic load balancing is necessary for parallel adaptive methods to solve unsteady CFD problems on unstructured grids. We have presented such a dynamic load balancing framework called JOVE, in this paper. Results on a four-POWERnode POWER CHALLENGEarray demonstrated that load balancing gives significant performance improvements over no load balancing for such adaptive computations. The parallel speedup of JOVE, implemented using MPI on the POWER CHALLENCEarray, was significant, being as high as 31 for 32 processors. An implementation of JOVE that exploits 'an array of SMPS' architecture was also studied; this hybrid JOVE outperformed flat JOVE by up to 28% on the meshes and adaption models tested. With large, realistic meshes and actual flow-solver and adaption phases incorporated into JOVE, hybrid JOVE can be expected to yield significant advantage over flat JOVE, especially as the number of processors is increased, thus demonstrating the scalability of an array of SMPs architecture.

  9. An Experimental Digital Image Processor

    NASA Astrophysics Data System (ADS)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  10. Parallel programming interface for distributed data

    NASA Astrophysics Data System (ADS)

    Wang, Manhui; May, Andrew J.; Knowles, Peter J.

    2009-12-01

    The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summaryProgram title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2-256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with

  11. MPP parallel forth

    NASA Technical Reports Server (NTRS)

    Dorband, John E.

    1987-01-01

    Massively Parallel Processor (MPP) Parallel FORTH is a derivative of FORTH-83 and Unified Software Systems' Uni-FORTH. The extension of FORTH into the realm of parallel processing on the MPP is described. With few exceptions, Parallel FORTH was made to follow the description of Uni-FORTH as closely as possible. Likewise, the parallel FORTH extensions were designed as philosophically similar to serial FORTH as possible. The MPP hardware characteristics, as viewed by the FORTH programmer, is discussed. Then a description is presented of how parallel FORTH is implemented on the MPP.

  12. A systolic array architecture for the Applebaum-Howells array

    NASA Astrophysics Data System (ADS)

    Ueno, Motoharu; Kawabata, K.; Morooka, Tasuku

    1990-08-01

    A systolic array architecture for the Applebaum-Howells array is derived. The problem to be solved is the elimination of the global signal feedback loop in the conventional Applebaum-Howells array processor. The procedure involved in deriving the architecture consists of two steps: orthogonalization of the input element signals and elimination of the feedback loop. In the first step, the input element signals are orthogonalized with regard to each other by using the Gram-Schmidt processor, placed ahead of the Applebaum-Howells processor. It is shown in the second step that the orthogonality in the Gram-Schmidt processor output signals can remove the global signal feedback loop and that the Applebaum-Howells array can be implemented effectively by using a systolic array with regular structure and local communication. Simulation results also show that the proposed processor features desirable characteristics for the radiation pattern with low sidelobe level common to the Applebaum-Howells array.

  13. RISC Processors and High Performance Computing

    NASA Technical Reports Server (NTRS)

    Bailey, David H.; Saini, Subhash; Craw, James M. (Technical Monitor)

    1995-01-01

    This tutorial will discuss the top five RISC microprocessors and the parallel systems in which they are used. It will provide a unique cross-machine comparison not available elsewhere. The effective performance of these processors will be compared by citing standard benchmarks in the context of real applications. The latest NAS Parallel Benchmarks, both absolute performance and performance per dollar, will be listed. The next generation of the NPB will be described. The tutorial will conclude with a discussion of future directions in the field. Technology Transfer Considerations: All of these computer systems are commercially available internationally. Information about these processors is available in the public domain, mostly from the vendors themselves. The NAS Parallel Benchmarks and their results have been previously approved numerous times for public release, beginning back in 1991.

  14. Parallel algorithms for mapping pipelined and parallel computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.

    1988-01-01

    Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

  15. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  16. A novel picoliter droplet array for parallel real-time polymerase chain reaction based on double-inkjet printing.

    PubMed

    Sun, Yingnan; Zhou, Xiaoguang; Yu, Yude

    2014-09-21

    We developed and characterized a novel picoliter droplet-in-oil array generated by a double-inkjet printing method on a uniform hydrophobic silicon chip specifically designed for quantitative polymerase chain reaction (qPCR) analysis. Double-inkjet printing was proposed to efficiently address the evaporation issues of picoliter droplets during array generation on a planar substrate without the assistance of a humidifier or glycerol. The method utilizes piezoelectric inkjet printing equipment to precisely eject a reagent droplet into an oil droplet, which had first been dispensed on a hydrophobic and oleophobic substrate. No evaporation, random movement, or cross-contamination was observed during array fabrication and thermal cycling. We demonstrated the feasibility and effectiveness of this novel double-inkjet method for real-time PCR analysis. This method can readily produce multivolume droplet-in-oil arrays with volume variations ranging from picoliters to nanoliters. This feature would be useful for simultaneous multivolume PCR experiments aimed at wide and tunable dynamic ranges. These double-inkjet-based picoliter droplet arrays may have potential for multiplexed applications that require isolated containers for single-cell cultures, single molecular enzymatic assays, or digital PCR and provide an alternative option for generating droplet arrays on planar substrates without chemical patterning. PMID:25070461

  17. A novel picoliter droplet array for parallel real-time polymerase chain reaction based on double-inkjet printing.

    PubMed

    Sun, Yingnan; Zhou, Xiaoguang; Yu, Yude

    2014-09-21

    We developed and characterized a novel picoliter droplet-in-oil array generated by a double-inkjet printing method on a uniform hydrophobic silicon chip specifically designed for quantitative polymerase chain reaction (qPCR) analysis. Double-inkjet printing was proposed to efficiently address the evaporation issues of picoliter droplets during array generation on a planar substrate without the assistance of a humidifier or glycerol. The method utilizes piezoelectric inkjet printing equipment to precisely eject a reagent droplet into an oil droplet, which had first been dispensed on a hydrophobic and oleophobic substrate. No evaporation, random movement, or cross-contamination was observed during array fabrication and thermal cycling. We demonstrated the feasibility and effectiveness of this novel double-inkjet method for real-time PCR analysis. This method can readily produce multivolume droplet-in-oil arrays with volume variations ranging from picoliters to nanoliters. This feature would be useful for simultaneous multivolume PCR experiments aimed at wide and tunable dynamic ranges. These double-inkjet-based picoliter droplet arrays may have potential for multiplexed applications that require isolated containers for single-cell cultures, single molecular enzymatic assays, or digital PCR and provide an alternative option for generating droplet arrays on planar substrates without chemical patterning.

  18. NOCA-1 functions with γ-tubulin and in parallel to Patronin to assemble non-centrosomal microtubule arrays in C. elegans.

    PubMed

    Wang, Shaohe; Wu, Di; Quintin, Sophie; Green, Rebecca A; Cheerambathur, Dhanya K; Ochoa, Stacy D; Desai, Arshad; Oegema, Karen

    2015-09-15

    Non-centrosomal microtubule arrays assemble in differentiated tissues to perform mechanical and transport-based functions. In this study, we identify Caenorhabditis elegans NOCA-1 as a protein with homology to vertebrate ninein. NOCA-1 contributes to the assembly of non-centrosomal microtubule arrays in multiple tissues. In the larval epidermis, NOCA-1 functions redundantly with the minus end protection factor Patronin/PTRN-1 to assemble a circumferential microtubule array essential for worm growth and morphogenesis. Controlled degradation of a γ-tubulin complex subunit in this tissue revealed that γ-tubulin acts with NOCA-1 in parallel to Patronin/PTRN-1. In the germline, NOCA-1 and γ-tubulin co-localize at the cell surface, and inhibiting either leads to a microtubule assembly defect. γ-tubulin targets independently of NOCA-1, but NOCA-1 targeting requires γ-tubulin when a non-essential putatively palmitoylated cysteine is mutated. These results show that NOCA-1 acts with γ-tubulin to assemble non-centrosomal arrays in multiple tissues and highlight functional overlap between the ninein and Patronin protein families.

  19. NOCA-1 functions with γ-tubulin and in parallel to Patronin to assemble non-centrosomal microtubule arrays in C. elegans

    PubMed Central

    Wang, Shaohe; Wu, Di; Quintin, Sophie; Green, Rebecca A; Cheerambathur, Dhanya K; Ochoa, Stacy D; Desai, Arshad; Oegema, Karen

    2015-01-01

    Non-centrosomal microtubule arrays assemble in differentiated tissues to perform mechanical and transport-based functions. In this study, we identify Caenorhabditis elegans NOCA-1 as a protein with homology to vertebrate ninein. NOCA-1 contributes to the assembly of non-centrosomal microtubule arrays in multiple tissues. In the larval epidermis, NOCA-1 functions redundantly with the minus end protection factor Patronin/PTRN-1 to assemble a circumferential microtubule array essential for worm growth and morphogenesis. Controlled degradation of a γ-tubulin complex subunit in this tissue revealed that γ-tubulin acts with NOCA-1 in parallel to Patronin/PTRN-1. In the germline, NOCA-1 and γ-tubulin co-localize at the cell surface, and inhibiting either leads to a microtubule assembly defect. γ-tubulin targets independently of NOCA-1, but NOCA-1 targeting requires γ-tubulin when a non-essential putatively palmitoylated cysteine is mutated. These results show that NOCA-1 acts with γ-tubulin to assemble non-centrosomal arrays in multiple tissues and highlight functional overlap between the ninein and Patronin protein families. DOI: http://dx.doi.org/10.7554/eLife.08649.001 PMID:26371552

  20. Trajectory optimization using parallel shooting method on parallel computer

    SciTech Connect

    Wirthman, D.J.; Park, S.Y.; Vadali, S.R.

    1995-03-01

    The efficiency of a parallel shooting method on a parallel computer for solving a variety of optimal control guidance problems is studied. Several examples are considered to demonstrate that a speedup of nearly 7 to 1 is achieved with the use of 16 processors. It is suggested that further improvements in performance can be achieved by parallelizing in the state domain. 10 refs.

  1. Architecture and data processing alternatives for the TSE computer. Volume 3: Execution of a parallel counting algorithm using array logic (Tse) devices

    NASA Technical Reports Server (NTRS)

    Metcalfe, A. G.; Bodenheimer, R. E.

    1976-01-01

    A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.

  2. Reconfigurable computer array: The bridge between high speed sensors and low speed computing

    SciTech Connect

    Robinson, S.H.; Caffrey, M.P.; Dunham, M.E.

    1998-06-16

    A universal limitation of RF and imaging front-end sensors is that they easily produce data at a higher rate than any general-purpose computer can continuously handle. Therefore, Los Alamos National Laboratory has developed a custom Reconfigurable Computing Array board to support a large variety of processing applications including wideband RF signals, LIDAR and multi-dimensional imaging. The boards design exploits three key features to achieve its performance. First, there are large banks of fast memory dedicated to each reconfigurable processor and also shared between pairs of processors. Second, there are dedicated data paths between processors, and from a processor to flexible I/O interfaces. Third, the design provides the ability to link multiple boards into a serial and/or parallel structure.

  3. Onboard processor technology review

    NASA Technical Reports Server (NTRS)

    Benz, Harry F.

    1990-01-01

    The general need and requirements for the onboard embedded processors necessary to control and manipulate data in spacecraft systems are discussed. The current known requirements are reviewed from a user perspective, based on current practices in the spacecraft development process. The current capabilities of available processor technologies are then discussed, and these are projected to the generation of spacecraft computers currently under identified, funded development. An appraisal is provided for the current national developmental effort.

  4. Programmable DNA-Mediated Multitasking Processor.

    PubMed

    Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-04-30

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.

  5. Programmable DNA-Mediated Multitasking Processor.

    PubMed

    Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-04-30

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices. PMID:25874653

  6. Parallel asynchronous hardware implementation of image processing algorithms

    NASA Technical Reports Server (NTRS)

    Coon, Darryl D.; Perera, A. G. U.

    1990-01-01

    Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.

  7. Scioto: A Framework for Global-ViewTask Parallelism

    SciTech Connect

    Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

    2008-09-09

    We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

  8. Dielectrophoresis-based programmable fluidic processors.

    PubMed

    Gascoyne, Peter R C; Vykoukal, Jody V; Schwartz, Jon A; Anderson, Thomas J; Vykoukal, Daynene M; Current, K Wayne; McConaghy, Charles; Becker, Frederick F; Andrews, Craig

    2004-08-01

    Droplet-based programmable processors promise to offer solutions to a wide range of applications in which chemical and biological analysis and/or small-scale synthesis are required, suggesting they will become the microfluidic equivalents of microprocessors by offering off-the-shelf solutions for almost any fluid based analysis or small scale synthesis problem. A general purpose droplet processor should be able to manipulate droplets of different compositions (including those that are electrically conductive or insulating and those of polar or non-polar nature), to control reagent titrations accurately, and to remain free of contamination and carry over on its reaction surfaces. In this article we discuss the application of dielectrophoresis to droplet based processors and demonstrate that it can provide the means for accurately titrating, moving and mixing polar or non-polar droplets whether they are electrically conductive or not. DEP does not require contact with control surfaces and several strategies for minimizing surface contact are presented. As an example of a DEP actuated general purpose droplet processor, we show an embodiment based on a scaleable CMOS architecture that uses DEP manipulation on a 32 x 32 electrode array having built-in control and switching circuitry. Lastly, we demonstrate the concept of a general-purpose programming environment that facilitates droplet software development for any type of droplet processor.

  9. A Sub 100mW H.264 MP@L4.1 Integer-Pel Motion Estimation Processor Core for MBAFF Encoding with Reconfigurable Ring-Connected Systolic Array and Segmentation-Free, Rectangle-Access Search-Window Buffer

    NASA Astrophysics Data System (ADS)

    Murachi, Yuichiro; Miyakoshi, Junichi; Hamamoto, Masaki; Iinuma, Takahiro; Ishihara, Tomokazu; Yin, Fang; Lee, Jangchung; Kawaguchi, Hiroshi; Yoshimoto, Masahiko

    We describe a sub 100-mW H.264 MP@L4.1 integerpel motion estimation processor core for low power video encoder. It supports macro block adaptive frame field (MBAFF) encoding and bidirectional prediction for a resolution of 1920×1080 pixels at 30fps. The proposed processor features a novel hierarchical algorithm, reconfigurable ring-connected systolic array architecture and segmentation-free, rectangle-access search window buffer. The hierarchical algorithm consists of a fine search and a coarse search. A complementary recursive cross search is newly introduced in the coarse search. The fine search is adaptively carried out, based on an image analysis result obtained by the coarse search. The proposed systolic array architecture minimizes the amount of transferred data, and lowers computation cycles for the coarse and fine searches. In addition, we propose a novel search window buffer SRAM that has instantaneous accessibility to a rectangular area with arbitrary location. The processor core has been designed with a 90nm CMOS design rule. Core size is 2.5×2.5mm2. One core supports one-reference-frame and dissipates 48mW at 1V. Two core configuration consumes 96mW for two-reference-frame search.

  10. Design and microfabrication of a high-aspect-ratio PDMS microbeam array for parallel nanonewton force measurement and protein printing

    NASA Astrophysics Data System (ADS)

    Sasoglu, F. M.; Bohl, A. J.; Layton, B. E.

    2007-03-01

    Cell and protein mechanics has applications ranging from cellular development to tissue engineering. Techniques such as magnetic tweezers, optic tweezers and atomic force microscopy have been used to measure cell deformation forces of the order of piconewtons to nanonewtons. In this study, an array of polymeric polydimethylsiloxane (PDMS) microbeams with diameters of 10-40 µm and lengths of 118 µm was fabricated from Sylgard® with curing agent concentrations ranging from 5% to 20%. The resulting spring constants were 100-300 nN µm-1. The elastic modulus of PDMS was determined experimentally at different curing agent concentrations and found to be 346 kPa to 704 kPa in a millimeter-scale array and ~1 MPa in a microbeam array. Additionally, the microbeam array was used to print laminin for the purpose of cell adhesion. Linear and nonlinear finite element analyses are presented and compared to the closed-from solution. The highly compliant, transparent, biocompatible PDMS may offer a method for more rapid throughput in cell and protein mechanics force measurement experiments with sensitivities necessary for highly compliant structures such as axons.

  11. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, Dario B.

    1994-01-01

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

  12. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, D.B.

    1994-07-19

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

  13. Comparison of measurements and simulations of series-parallel incommensurate area superconducting quantum interference device arrays fabricated from YBa2Cu3O7-δ ion damage Josephson junctions

    NASA Astrophysics Data System (ADS)

    Cybart, Shane A.; Dalichaouch, T. N.; Wu, S. M.; Anton, S. M.; Drisko, J. A.; Parker, J. M.; Harteneck, B. D.; Dynes, R. C.

    2012-09-01

    We have fabricated series-parallel (two-dimensional) arrays of incommensurate superconducting quantum interference devices (SQUIDs) using YBa2Cu3O7-δ thin film ion damage Josephson junctions. The arrays initially consisted of a grid of Josephson junctions with 28 junctions in parallel and 565 junctions in series, for a total of 15 255 SQUIDs. The 28 junctions in the parallel direction were sequentially decreased by removing them with photolithography and ion milling to allow comparisons of voltage-magnetic field (V-B) characteristics for different parallel dimensions and area distributions. Comparisons of measurements for these different configurations reveal that the maximum voltage modulation with magnetic field is significantly reduced by both the self inductances of the SQUIDs and the mutual inductances between them. Based on these results, we develop a computer simulation model from first principles which simultaneously solves the differential equations of the junctions in the array while considering the effects of self inductance, mutual inductance, and non-uniformity of junction critical currents. We find that our model can accurately predict V-B for all of the array geometries studied. A second experiment is performed where we use photolithography and ion milling to split another 28 × 565 junction array into 6 decoupled arrays to further investigate mutual interactions between adjacent SQUIDs. This work conclusively shows that the magnetic fields generated by self currents in an incommensurate array severely reduce its performance by reducing the maximum obtainable modulation voltage.

  14. Unified optical symbolic substitution processor

    NASA Astrophysics Data System (ADS)

    Casasent, David P.

    1990-07-01

    Symbolic substitution operations can be realized optically on a correlator. This is a very attractive and efficient architecture for symbolic substitution. It allows parallel multichannel realization with a fixed set of filters (on film or easily realized on low space bandwidth product spatial light modulators) using space and frequency-multiplexing or sequential filters. All basic logic, numeric and morphological image processing functions can be achieved by symbolic substitution. Moreover, all operations are possible on one multifunctional optical processor. Morphological operations are felt to be essential for ATR and pattern recognition preprocessing in clutter. They greatly improve the role for optics by allowing the same optical architecture to be used for low, medium and high level vision.

  15. High-Speed General Purpose Genetic Algorithm Processor.

    PubMed

    Hoseini Alinodehi, Seyed Pourya; Moshfe, Sajjad; Saber Zaeimian, Masoumeh; Khoei, Abdollah; Hadidi, Khairollah

    2016-07-01

    In this paper, an ultrafast steady-state genetic algorithm processor (GAP) is presented. Due to the heavy computational load of genetic algorithms (GAs), they usually take a long time to find optimum solutions. Hardware implementation is a significant approach to overcome the problem by speeding up the GAs procedure. Hence, we designed a digital CMOS implementation of GA in [Formula: see text] process. The proposed processor is not bounded to a specific application. Indeed, it is a general-purpose processor, which is capable of performing optimization in any possible application. Utilizing speed-boosting techniques, such as pipeline scheme, parallel coarse-grained processing, parallel fitness computation, parallel selection of parents, dual-population scheme, and support for pipelined fitness computation, the proposed processor significantly reduces the processing time. Furthermore, by relying on a built-in discard operator the proposed hardware may be used in constrained problems that are very common in control applications. In the proposed design, a large search space is achievable through the bit string length extension of individuals in the genetic population by connecting the 32-bit GAPs. In addition, the proposed processor supports parallel processing, in which the GAs procedure can be run on several connected processors simultaneously.

  16. Comparison of simulated parallel transmit body arrays at 3 T using excitation uniformity, global SAR, local SAR and power efficiency metrics

    PubMed Central

    Guérin, Bastien; Gebhardt, Matthias; Serano, Peter; Adalsteinsson, Elfar; Hamm, Michael; Pfeuffer, Josef; Nistler, Juergen; Wald, Lawrence L.

    2014-01-01

    Purpose We compare the performance of 8 parallel transmit (pTx) body arrays with up to 32 channels and a standard birdcage design. Excitation uniformity, local SAR, global SAR and power metrics are analyzed in the torso at 3 T for RF-shimming and 2-spoke excitations. Methods We used a fast co-simulation strategy for field calculation in the presence of coupling between transmit channels. We designed spoke pulses using magnitude least squares (MLS) optimization with explicit constraint of SAR and power and compared the performance of the different pTx coils using the L-curve method. Results PTx arrays outperformed the conventional birdcage coil in all metrics except peak and average power efficiency. The presence of coupling exacerbated this power efficiency problem. At constant excitation fidelity, the pTx array with 24 channels arranged in 3 z-rows could decrease local SAR more than 4-fold (2-fold) for RF-shimming (2-spoke) compared to the birdcage coil for pulses of equal duration. Multi-row pTx coils had a marked performance advantage compared to single row designs, especially for coronal imaging. Conclusion PTx coils can simultaneously improve the excitation uniformity and reduce SAR compared to a birdcage coil when SAR metrics are explicitly constrained in the pulse design. PMID:24752979

  17. Configurable Multi-Purpose Processor

    NASA Technical Reports Server (NTRS)

    Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

    2010-01-01

    Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and

  18. Optical cellular processor architecture. 1: Principles.

    PubMed

    Taboury, J; Wang, J M; Chavel, P; Devos, F; Garda, P

    1988-05-01

    General characteristics and advantages of 2-D optical cellular processors are listed and discussed, with reference to the concepts of cellular automata, symbolic substitution, and neural nets. The role of optical interconnections and of quasilinear processing combining linear array operations and pointwise nonlinearities is highlighted. An architecture for optical implementation of cellular automata is introduced; it features high density 3-D optical shift-invariant interconnections and programmability of the interconnection pattern through adequate use of holographic connectors.

  19. Design and parallel fabrication of wire-grid polarization arrays for polarization-resolved imaging at 1.55 microm.

    PubMed

    Zhou, Yaling; Klotzkin, David J

    2008-07-10

    Polarization-resolved imaging can provide information about the composition and topography of the environment that is invisible to the eye. We demonstrate a practical method to fabricate arrays of small, orthogonal wire-grid polarizers (WGPs) that can be matched to individual detector pixels, and we present design curves that relate the structure to the polarization extinction ratio obtained. The photonic area lithographically mapped (PALM) method uses multiple-exposure conventional and holographic lithography to create subwavelength patterns easily aligned to conventional mask features. WGPs with polarization extinction ratios of approximately 10 at a 1.55 microm wavelength were fabricated, and square centimeter areas of square micrometer size WGP arrays suitable for polarization-resolved imaging on glass were realized.

  20. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing.

    PubMed

    Park, Hansoo; Kim, Jong-Il; Ju, Young Seok; Gokcumen, Omer; Mills, Ryan E; Kim, Sheehyun; Lee, Seungbok; Suh, Dongwhan; Hong, Dongwan; Kang, Hyunseok Peter; Yoo, Yun Joo; Shin, Jong-Yeon; Kim, Hyun-Jin; Yavartanoo, Maryam; Chang, Young Wha; Ha, Jung-Sook; Chong, Wilson; Hwang, Ga-Ram; Darvishi, Katayoon; Kim, Hyeran; Yang, Song Ju; Yang, Kap-Seok; Kim, Hyungtae; Hurles, Matthew E; Scherer, Stephen W; Carter, Nigel P; Tyler-Smith, Chris; Lee, Charles; Seo, Jeong-Sun

    2010-05-01

    Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3x coverage) and two Asian genomes (AK1, with 27.8x coverage and AK2, with 32.0x coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine.

  1. Gang scheduling a parallel machine

    SciTech Connect

    Gorda, B.C.; Brooks, E.D. III.

    1991-03-01

    Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

  2. Multiple Embedded Processors for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

    2005-01-01

    A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

  3. Dedicated hardware processor and corresponding system-on-chip design for real-time laser speckle imaging.

    PubMed

    Jiang, Chao; Zhang, Hongyan; Wang, Jia; Wang, Yaru; He, Heng; Liu, Rui; Zhou, Fangyuan; Deng, Jialiang; Li, Pengcheng; Luo, Qingming

    2011-11-01

    Laser speckle imaging (LSI) is a noninvasive and full-field optical imaging technique which produces two-dimensional blood flow maps of tissues from the raw laser speckle images captured by a CCD camera without scanning. We present a hardware-friendly algorithm for the real-time processing of laser speckle imaging. The algorithm is developed and optimized specifically for LSI processing in the field programmable gate array (FPGA). Based on this algorithm, we designed a dedicated hardware processor for real-time LSI in FPGA. The pipeline processing scheme and parallel computing architecture are introduced into the design of this LSI hardware processor. When the LSI hardware processor is implemented in the FPGA running at the maximum frequency of 130 MHz, up to 85 raw images with the resolution of 640×480 pixels can be processed per second. Meanwhile, we also present a system on chip (SOC) solution for LSI processing by integrating the CCD controller, memory controller, LSI hardware processor, and LCD display controller into a single FPGA chip. This SOC solution also can be used to produce an application specific integrated circuit for LSI processing.

  4. Universal voice processor development

    NASA Technical Reports Server (NTRS)

    1972-01-01

    The development of a universal voice processor is discussed. The device is based on several circuit configurations using hybrid techniques to satisfy the electrical specifications. The steps taken during the design process are described. Circuit diagrams of the final design are presented. Mathematical models are included to support the theoretical aspects.

  5. Recorder/processor apparatus

    NASA Technical Reports Server (NTRS)

    Shim, I. H.; Stelben, J. J.

    1974-01-01

    Laser beam is intensity modulated in response to incoming video signals. Latent image is recorded on rotating drum which generates raster in conjunction with incrementally-driven lens carriage. Image is fed automatically to thermal processor; actual image is developed by controlled application of heat onto medium containing latent image.

  6. A digital retina-like low-level vision processor.

    PubMed

    Mertoguno, S; Bourbakis, N G

    2003-01-01

    This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper.

  7. National Resource for Computation in Chemistry (NRCC). Attached scientific processors for chemical computations: a report to the chemistry community

    SciTech Connect

    Ostlund, N.S.

    1980-01-01

    The demands of chemists for computational resources are well known and have been amply documented. The best and most cost-effective means of providing these resources is still open to discussion, however. This report surveys the field of attached scientific processors (array processors) and attempts to indicate their present and possible future use in computational chemistry. Array processors have the possibility of providing very cost-effective computation. This report attempts to provide information that will assist chemists who might be considering the use of an array processor for their computations. It describes the general ideas and concepts involved in using array processors, the commercial products that are available, and the experiences reported by those currently using them. In surveying the field of array processors, the author makes certain recommendations regarding their use in computational chemistry. 5 figures, 1 table (RWR)

  8. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, M.S.; Strip, D.R.

    1996-01-30

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

  9. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, Michael S.; Strip, David R.

    1996-01-01

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

  10. Parallel image compression

    NASA Technical Reports Server (NTRS)

    Reif, John H.

    1987-01-01

    A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

  11. Artificial intelligence in parallel

    SciTech Connect

    Waldrop, M.M.

    1984-08-10

    The current rage in the Artificial Intelligence (AI) community is parallelism: the idea is to build machines with many independent processors doing many things at once. The upshot is that about a dozen parallel machines are now under development for AI alone. As might be expected, the approaches are diverse yet there are a number of fundamental issues in common: granularity, topology, control, and algorithms.

  12. Soft-core processor study for node-based architectures.

    SciTech Connect

    Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

    2008-09-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

  13. Reduced sensitivity algorithm for optical processors using constraints and ridge regression.

    PubMed

    Casasent, D; Ghosh, A

    1988-04-15

    Optical linear algebra processors that involve solutions of linear algebraic equations have significant potential in adaptive and inference machines. We present an algorithm that includes constraints on the accuracy of the processor and improves the accuracy of the results obtained from such analog processors. The constraint algorithm matches the problem to the accuracy of the processor. Calculation of the adaptive weights in a phased array radar is used as a case study. Simulation results prove the benefits advertised. The desensitization of the calculated weights to computational errors in the processor is quantified. Ridge regression isused to determine the parameter needed in the algorithm.

  14. Buffered coscheduling for parallel programming and enhanced fault tolerance

    DOEpatents

    Petrini, Fabrizio; Feng, Wu-chun

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  15. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  16. Extended VLIW processor for real-time imaging

    NASA Astrophysics Data System (ADS)

    Sakai, Keiichi; Fujiwara, Itaru; Ae, Tadashi

    2001-04-01

    We propose EVLIW as a new processor architecture which is designed for general purpose processing and is suitable especially for real-time image processing. The processor architecture is a VLIW, but it has more functional units than the generic VLIW processor has. The EVLIW consists of the interconnection network for connecting the neighbor and of functional units, which are more primitive than in the generic VLIW processor. Some of general-purpose processors in the market includes several processing units, e.g. the same four single precision floating-point or four 16bit-word integer units for Intel processor with SSE/MMX, where the four units do the same operation with the four different data. In the image processing, the data are processed in parallel, where the operating is not complicated an only the high-speed processing is usually required. We have tried a simple image processing using Intel's processor with SSE/MMX and summarize the results. In this paper, we describe a new architecture for real-time imaging, and its design, comparing with Intel's processor with SSE/MMX.

  17. Stochastic propagation of an array of parallel cracks: Exploratory work on matrix fatigue damage in composite laminates

    SciTech Connect

    Williford, R.E.

    1989-09-01

    Transverse cracking of polymeric matrix materials is an important fatigue damage mechanism in continuous-fiber composite laminates. The propagation of an array of these cracks is a stochastic problem usually treated by Monte Carlo methods. However, this exploratory work proposes an alternative approach wherein the Monte Carlo method is replaced by a more closed-form recursion relation based on fractional Brownian motion.'' A fractal scaling equation is also proposed as a substitute for the more empirical Paris equation describing individual crack growth in this approach. Preliminary calculations indicate that the new recursion relation is capable of reproducing the primary features of transverse matrix fatigue cracking behavior. Although not yet fully tested or verified, this cursion relation may eventually be useful for real-time applications such as monitoring damage in aircraft structures.

  18. Fabrication and Evaluation of a Micro(Bio)Sensor Array Chip for Multiple Parallel Measurements of Important Cell Biomarkers

    PubMed Central

    Pemberton, Roy M.; Cox, Timothy; Tuffin, Rachel; Drago, Guido A.; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C.; Davies, Rhodri; Jackson, Simon K.; Kenna, Gerry; Luxton, Richard; Hart, John P.

    2014-01-01

    This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

  19. Fabrication and evaluation of a micro(bio)sensor array chip for multiple parallel measurements of important cell biomarkers.

    PubMed

    Pemberton, Roy M; Cox, Timothy; Tuffin, Rachel; Drago, Guido A; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C; Davies, Rhodri; Jackson, Simon K; Kenna, Gerry; Luxton, Richard; Hart, John P

    2014-01-01

    This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

  20. Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics.

    PubMed

    Shao, Yiping; Sun, Xishan; Lan, Kejian A; Bircher, Chad; Lou, Kai; Deng, Zhi

    2014-03-01

    In this study, we developed a prototype animal PET by applying several novel technologies to use solid-state photomultiplier (SSPM) arrays to measure the depth of interaction (DOI) and improve imaging performance. Each PET detector has an 8 × 8 array of about 1.9 × 1.9 × 30.0 mm(3) lutetium-yttrium-oxyorthosilicate scintillators, with each end optically connected to an SSPM array (16 channels in a 4 × 4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3 × 3 mm(2), and its output is read by a custom-developed application-specific integrated circuit to directly convert analogue signals to digital timing pulses that encode the interaction information. These pulses are transferred to and are decoded by a field-programmable gate array-based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable the use of flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered subset expectation maximization image reconstruction was implemented. The measured mean energy, coincidence timing and DOI resolution for a crystal were about 17.6%, 2.8 ns and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI

  1. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors

  2. Radiofrequency current source (RFCS) drive and decoupling technique for parallel transmit arrays using a high-power metal oxide semiconductor field-effect transistor (MOSFET).

    PubMed

    Lee, Wonje; Boskamp, Eddy; Grist, Thomas; Kurpad, Krishna

    2009-07-01

    A radiofrequency current source (RFCS) design using a high-power metal oxide semiconductor field effect transistor (MOSFET) that enables independent current control for parallel transmit applications is presented. The design of an RFCS integrated with a series tuned transmitting loop and its associated control circuitry is described. The current source is operated in a gated class AB push-pull configuration for linear operation at high efficiency. The pulsed RF current amplitude driven into the low impedance transmitting loop was found to be relatively insensitive to the various loaded loop impedances ranging from 0.4 to 10.3 ohms, confirming current mode operation. The suppression of current induced by a neighboring loop was quantified as a function of center-to-center loop distance, and was measured to be 17 dB for nonoverlapping, adjacent loops. Deterministic manipulation of the B(1) field pattern was demonstrated by the independent control of RF phase and amplitude in a head-sized two-channel volume transmit array. It was found that a high-voltage rated RF power MOSFET with a minimum load resistance, exhibits current source behavior, which aids in transmit array design.

  3. Radiofrequency current source (RFCS) drive and decoupling technique for parallel transmit arrays using a high-power metal oxide semiconductor field-effect transistor (MOSFET).

    PubMed

    Lee, Wonje; Boskamp, Eddy; Grist, Thomas; Kurpad, Krishna

    2009-07-01

    A radiofrequency current source (RFCS) design using a high-power metal oxide semiconductor field effect transistor (MOSFET) that enables independent current control for parallel transmit applications is presented. The design of an RFCS integrated with a series tuned transmitting loop and its associated control circuitry is described. The current source is operated in a gated class AB push-pull configuration for linear operation at high efficiency. The pulsed RF current amplitude driven into the low impedance transmitting loop was found to be relatively insensitive to the various loaded loop impedances ranging from 0.4 to 10.3 ohms, confirming current mode operation. The suppression of current induced by a neighboring loop was quantified as a function of center-to-center loop distance, and was measured to be 17 dB for nonoverlapping, adjacent loops. Deterministic manipulation of the B(1) field pattern was demonstrated by the independent control of RF phase and amplitude in a head-sized two-channel volume transmit array. It was found that a high-voltage rated RF power MOSFET with a minimum load resistance, exhibits current source behavior, which aids in transmit array design. PMID:19353658

  4. Rapid, single-molecule assays in nano/micro-fluidic chips with arrays of closely spaced parallel channels fabricated by femtosecond laser machining.

    PubMed

    Canfield, Brian K; King, Jason K; Robinson, William N; Hofmeister, William H; Davis, Lloyd M

    2014-01-01

    Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

  5. Rapid, single-molecule assays in nano/micro-fluidic chips with arrays of closely spaced parallel channels fabricated by femtosecond laser machining.

    PubMed

    Canfield, Brian K; King, Jason K; Robinson, William N; Hofmeister, William H; Davis, Lloyd M

    2014-08-20

    Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values.

  6. On the design, analysis, and implementation of efficient parallel algorithms

    SciTech Connect

    Sohn, S.M.

    1989-01-01

    There is considerable interest in developing algorithms for a variety of parallel computer architectures. This is not a trivial problem, although for certain models great progress has been made. Recently, general-purpose parallel machines have become available commercially. These machines possess widely varying interconnection topologies and data/instruction access schemes. It is important, therefore, to develop methodologies and design paradigms for not only synthesizing parallel algorithms from initial problem specifications, but also for mapping algorithms between different architectures. This work has considered both of these problems. A systolic array consists of a large collection of simple processors that are interconnected in a uniform pattern. The author has studied in detain the problem of mapping systolic algorithms onto more general-purpose parallel architectures such as the hypercube. The hypercube architecture is notable due to its symmetry and high connectivity, characteristics which are conducive to the efficient embedding of parallel algorithms. Although the parallel-to-parallel mapping techniques have yielded efficient target algorithms, it is not surprising that an algorithm designed directly for a particular parallel model would achieve superior performance. In this context, the author has developed hypercube algorithms for some important problems in speech and signal processing, text processing, language processing and artificial intelligence. These algorithms were implemented on a 64-node NCUBE/7 hypercube machine in order to evaluate their performance.

  7. Efficiency of parallel direct optimization

    NASA Technical Reports Server (NTRS)

    Janies, D. A.; Wheeler, W. C.

    2001-01-01

    Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

  8. RISC Processors and High Performance Computing

    NASA Technical Reports Server (NTRS)

    Saini, Subhash; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

    1995-01-01

    In this tutorial, we will discuss top five current RISC microprocessors: The IBM Power2, which is used in the IBM RS6000/590 workstation and in the IBM SP2 parallel supercomputer, the DEC Alpha, which is in the DEC Alpha workstation and in the Cray T3D; the MIPS R8000, which is used in the SGI Power Challenge; the HP PA-RISC 7100, which is used in the HP 700 series workstations and in the Convex Exemplar; and the Cray proprietary processor, which is used in the new Cray J916. The architecture of these microprocessors will first be presented. The effective performance of these processors will then be compared, both by citing standard benchmarks and also in the context of implementing a real applications. In the process, different programming models such as data parallel (CM Fortran and HPF) and message passing (PVM and MPI) will be introduced and compared. The latest NAS Parallel Benchmark (NPB) absolute performance and performance per dollar figures will be presented. The next generation of the NP13 will also be described. The tutorial will conclude with a discussion of general trends in the field of high performance computing, including likely future developments in hardware and software technology, and the relative roles of vector supercomputers tightly coupled parallel computers, and clusters of workstations. This tutorial will provide a unique cross-machine comparison not available elsewhere.

  9. Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging

    NASA Astrophysics Data System (ADS)

    El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

    2014-01-01

    A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans.

  10. Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging.

    PubMed

    El-Ghussein, Fadi; Mastanduno, Michael A; Jiang, Shudong; Pogue, Brian W; Paulsen, Keith D

    2014-01-01

    A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

  11. Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging

    PubMed Central

    Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

    2013-01-01

    Abstract. A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

  12. Communications Support Processor (CSP)

    NASA Astrophysics Data System (ADS)

    Konopik, M. G.; Mack, R. B.

    1983-02-01

    This report discusses the advanced capabilities developed for the Communications Support Processor and the impact these capabilities have on the system and its users. The technical performance of the CSP, and the CSP system support functions and the on-site maintenance support are detailed in this report. The improvements to the CSP system include improved transportability site-unique gateways, on-line retrieval of traffic, plain language addressing and the fully automated routing of messages.

  13. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance FORTRAN. We show that, for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution and a computation involving the regular section A(l:h:s), the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little run-time overhead and acceptable preprocessing time.

  14. Generating local addresses and communication sets for data-parallel programs

    NASA Technical Reports Server (NTRS)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

  15. QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays.

    PubMed

    Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

    2014-01-01

    Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as "smart" Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users.

  16. QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays

    PubMed Central

    Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

    2014-01-01

    Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as “smart” Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

  17. MAP3D: a media processor approach for high-end 3D graphics

    NASA Astrophysics Data System (ADS)

    Darsa, Lucia; Stadnicki, Steven; Basoglu, Chris

    1999-12-01

    Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.

  18. Conversion via software of a simd processor into a mimd processor

    SciTech Connect

    Guzman, A.; Gerzso, M.; Norkin, K.B.; Vilenkin, S.Y.

    1983-01-01

    A method is described which takes a pure LISP program and automatically decomposes it via automatic parallelization into several parts, one for each processor of an SIMD architecture. Each of these parts is a different execution flow, i.e., a different program. The execution of these different programs by an SIMD architecture is examined. The method has been developed in some detail for the PS-2000, an SIMD Soviet multiprocessor, making it behave like AHR, a Mexican MIMD multi-microprocessor. Both the PS-2000 and AHR execute a pure LISP program in parallel; its decomposition into >n> pieces, their synchronization, scheduling, etc., are performed by the system (hardware and software). In order to achieve simultaneous execution of different programs in an SIMD processor, the method uses a scheme of node scheduling and node exportation. 14 references.

  19. The new UA1 calorimeter trigger processor

    SciTech Connect

    Baird, S.A.; Campbell, D.; Cawthraw, M.; Coughlan, J.; Flynn, P.; Galagadera, S.; Grayer, G.; Halsall, R.; Shah, T.P.; Stephens, R.

    1989-02-01

    The UA1 First Level Trigger Processor (TP) is a fast digital machine with a highly parallel pipelined architecture of fast TTL combinational and programmable logic controlled by programmable microsequencers. The TP uses 100,000 IC's housed in 18 crates each containing 21 fastbus sized modules. It is hardwired with a very high level of interconnection. The energy deposited in the upgraded calorimeter is digitised into 1700 bytes of input data every beam crossing. The Processor selects in 1.5 microseconds events for further processing. The new electron trigger has improved hadron jet rejection, achieved by requiring low energy deposition around the electro-magnetic cluster. A missing transverse energy trigger and a total energy trigger have also been implemented.

  20. Highly parallel computer architecture for robotic computation

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)

    1991-01-01

    In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  1. Fault detection and bypass in a sequence information signal processor

    NASA Technical Reports Server (NTRS)

    Peterson, John C. (Inventor); Chow, Edward T. (Inventor)

    1992-01-01

    The invention comprises a plurality of scan registers, each such register respectively associated with a processor element; an on-chip comparator, encoder and fault bypass register. Each scan register generates a unitary signal the logic state of which depends on the correctness of the input from the previous processor in the systolic array. These unitary signals are input to a common comparator which generates an output indicating whether or not an error has occurred. These unitary signals are also input to an encoder which identifies the location of any fault detected so that an appropriate multiplexer can be switched to bypass the faulty processor element. Input scan data can be readily programmed to fully exercise all of the processor elements so that no fault can remain undetected.

  2. Optical backplane interconnect switch for data processors and computers

    NASA Technical Reports Server (NTRS)

    Hendricks, Herbert D.; Benz, Harry F.; Hammer, Jacob M.

    1989-01-01

    An optoelectronic integrated device design is reported which can be used to implement an all-optical backplane interconnect switch. The switch is sized to accommodate an array of processors and memories suitable for direct replacement into the basic avionic multiprocessor backplane. The optical backplane interconnect switch is also suitable for direct replacement of the PI bus traffic switch and at the same time, suitable for supporting pipelining of the processor and memory. The 32 bidirectional switchable interconnects are configured with broadcast capability for controls, reconfiguration, and messages. The approach described here can handle a serial interconnection of data processors or a line-to-link interconnection of data processors. An optical fiber demonstration of this approach is presented.

  3. Problem size, parallel architecture and optimal speedup

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Willard, Frank H.

    1987-01-01

    The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture. The equation's domain is discretized into n sup 2 grid points which are divided into partitions and mapped onto the individual processor memories. The relationships between grid size, stencil type, partitioning strategy, processor execution time, and communication network type are analytically quantified. In so doing, the optimal number of processors was determined to assign to the solution, and identified (1) the smallest grid size which fully benefits from using all available processors, (2) the leverage on performance given by increasing processor speed or communication network speed, and (3) the suitability of various architectures for large numerical problems.

  4. Problem size, parallel architecture, and optimal speedup

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Willard, Frank H.

    1988-01-01

    The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture. The equation's domain is discretized into n sup 2 grid points which are divided into partitions and mapped onto the individual processor memories. The relationships between grid size, stencil type, partitioning strategy, processor execution time, and communication network type are analytically quantified. In so doing, the optimal number of processors was determined to assign to the solution, and identified (1) the smallest grid size which fully benefits from using all available processors, (2) the leverage on performance given by increasing processor speed or communication network speed, and (3) the suitability of various architectures for large numerical problems.

  5. Dielectrophoretic cell trapping and parallel one-to-one fusion based on field constriction created by a micro-orifice array

    PubMed Central

    Gel, Murat; Kimura, Yuji; Kurosawa, Osamu; Oana, Hidehiro; Kotera, Hidetoshi; Washizu, Masao

    2010-01-01

    Micro-orifice based cell fusion assures high-yield fusion without compromising the cell viability. This paper examines feasibility of a dielectrophoresis (DEP) assisted cell trapping method for parallel fusion with a micro-orifice array. The goal is to create viable fusants for studying postfusion cell behavior. We fabricated a microfluidic chip that contained a chamber and partition. The partition divided the chamber into two compartments and it had a number of embedded micro-orifices. The voltage applied to the electrodes located at each compartment generated an electric field distribution concentrating in micro-orifices. Cells introduced into each compartment moved toward the micro-orifice array by manipulation of hydrostatic pressure. DEP assisted trapping was used to keep the cells in micro-orifice and to establish cell to cell contact through orifice. By applying a pulse, cell fusion was initiated to form a neck between cells. The neck passing through the orifice resulted in immobilization of the fused cell pair at micro-orifice. After washing away the unfused cells, the chip was loaded to a microscope with stage top incubator for time lapse imaging of the selected fusants. The viable fusants were successfully generated by fusion of mouse fibroblast cells (L929). Time lapse observation of the fusants showed that fused cell pairs escaping from micro-orifice became one tetraploid cell. The generated tetraploid cells divided into three daughter cells. The fusants generated with a smaller micro-orifice (diameter∼2 μm) were kept immobilized at micro-orifice until cell division phase. After observation of two synchronized cell divisions, the fusant divided into four daughter cells. We conclude that the presented method of cell pairing and fusion is suitable for high-yield generation of viable fusants and furthermore, subsequent study of postfusion phenomena. PMID:20697592

  6. Multi-microprocessor that executes pure LISP in parallel

    SciTech Connect

    Guzman, A.

    1982-01-01

    The architecture presented allows parallel computation of high level languages, with some advantages: (1) the programmer is unaware that he is writing programs for a parallel computer; (2) the processors communicate little with each other, so that interconnection problems are minimised; (3) a given processor is unaware of how many other processors there are, or what they are doing; (4) a processor never waits for another process to have finished, nor does it awake or interrupt another processor. The machine processes in parallel programs written in high level languages capable of being expressed in the lambda notation (applicative languages). It is formed by a collection of general purpose processors which are weakly coupled and without hierarchy. Asynchronous computation is permitted by each processor evaluating a part of a program. 17 references.

  7. Distributed processor allocation for launching applications in a massively connected processors complex

    DOEpatents

    Pedretti, Kevin

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  8. Algorithmic commonalities in the parallel environment

    NASA Technical Reports Server (NTRS)

    Mcanulty, Michael A.; Wainer, Michael S.

    1987-01-01

    The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.

  9. Generic-type hierarchical multi digital signal processor system for hard-field tomography.

    PubMed

    Garcia Castillo, Sergio; Ozanyan, Krikor B

    2007-05-01

    This article introduces the design and implementation of a hierarchical multi digital signal processor system aimed to perform parallel multichannel measurements and data processing of the type widely used in hard-field tomography. Details are presented of a complete tomography system with modular and expandable architecture, capable of accommodating a variety of data processing modalities, configured by software. The configuration of the acquisition and processing circuits and the management of the data flow allow a data frame rate of up to 250 kHz. Results of a case study, guided path tomography for temperature mapping, are shown as a direct demonstration of the system's capabilities. Digital lock-in detection is employed for data processing to extract the information from ac measurements of the temperature-induced resistance changes in an array of 32 noninteracting transducers, which is further exported for visualization.

  10. Reconfigurable data path processor

    NASA Technical Reports Server (NTRS)

    Donohoe, Gregory (Inventor)

    2005-01-01

    A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.

  11. Scaling and Graphical Transport-Map Analysis of Ambipolar Schottky-Barrier Thin-Film Transistors Based on a Parallel Array of Si Nanowires.

    PubMed

    Jeon, Dae-Young; Pregl, Sebastian; Park, So Jeong; Baraban, Larysa; Cuniberti, Gianaurelio; Mikolajick, Thomas; Weber, Walter M

    2015-07-01

    Si nanowire (Si-NW) based thin-film transistors (TFTs) have been considered as a promising candidate for next-generation flexible and wearable electronics as well as sensor applications with high performance. Here, we have fabricated ambipolar Schottky-barrier (SB) TFTs consisting of a parallel array of Si-NWs and performed an in-depth study related to their electrical performance and operation mechanism through several electrical parameters extracted from the channel length scaling based method. Especially, the newly suggested current-voltage (I-V) contour map clearly elucidates the unique operation mechanism of the ambipolar SB-TFTs, governed by Schottky-junction between NiSi2 and Si-NW. Further, it reveals for the first-time in SB based FETs the important internal electrostatic coupling between the channel and externally applied voltages. This work provides helpful information for the realization of practical circuits with ambipolar SB-TFTs that can be transferred to different substrate technologies and applications.

  12. Signal-to-noise ratio and parallel imaging performance of commercially available phased array coils in 3.0 T brain magnetic resonance imaging.

    PubMed

    Yoshida, Tsukasa; Shirata, Kensei; Urikura, Atsushi; Ito, Michitoshi; Nakaya, Yoshihiro

    2015-07-01

    The signal-to-noise ratio (SNR) and parallel imaging (PI) performance of two commercial phased-array coils (PACs) were examined in magnetic resonance imaging (MRI) of the brain. All measurements were performed on a 3.0 T MRI instrument. The SNR and PI performance were evaluated with 32-channel and 15-channel PACs. A gradient echo sequence was used for obtaining images of a phantom. SNR and geometry factor (g-factor) maps were calculated from two images with identical parameters. Horizontal and vertical profiles were taken through the SNR maps in the axial plane. The average g-factor was measured in a circular region of interest in the g-factor maps for the axial plane. The SNR map of the 32-channel coil showed a higher SNR than that of the 15-channel coil at the phantom's posterior and lateral surfaces. The SNR profiles for the 32-channel coil also showed a 1.3-fold increase at the phantom's center. The average g-factor of the 32-channel coil was lower than that of the 15-channel coil at the same acceleration factor. These results indicate that the 32-channel coil can provide a higher spatial resolution and/or a faster imaging speed. Horizontal and vertical profiles are useful for evaluation of the performance of commercially available PACs.

  13. Simulation of an array-based neural net model

    NASA Technical Reports Server (NTRS)

    Barnden, John A.

    1987-01-01

    Research in cognitive science suggests that much of cognition involves the rapid manipulation of complex data structures. However, it is very unclear how this could be realized in neural networks or connectionist systems. A core question is: how could the interconnectivity of items in an abstract-level data structure be neurally encoded? The answer appeals mainly to positional relationships between activity patterns within neural arrays, rather than directly to neural connections in the traditional way. The new method was initially devised to account for abstract symbolic data structures, but it also supports cognitively useful spatial analogue, image-like representations. As the neural model is based on massive, uniform, parallel computations over 2D arrays, the massively parallel processor is a convenient tool for simulation work, although there are complications in using the machine to the fullest advantage. An MPP Pascal simulation program for a small pilot version of the model is running.

  14. Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.

    2000-01-01

    Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining

  15. LEWICE droplet trajectory calculations on a parallel computer

    NASA Technical Reports Server (NTRS)

    Caruso, Steven C.

    1993-01-01

    A parallel computer implementation (128 processors) of LEWICE, a NASA Lewis code used to predict the time-dependent ice accretion process for two-dimensional aerodynamic bodies of simple geometries, is described. Two-dimensional parallel droplet trajectory calculations are performed to demonstrate the potential benefits of applying parallel processing to ice accretion analysis. Parallel performance is evaluated as a function of the number of trajectories and the number of processors. For comparison, similar trajectory calculations are performed on single-processor Cray computers, and the best parallel results are found to be 33 and 23 times faster, respectively, than those of the Cray XMP and YMP.

  16. Parallel VLSI architecture emulation and the organization of APSA/MPP

    NASA Technical Reports Server (NTRS)

    Odonnell, John T.

    1987-01-01

    The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

  17. Optimal scheduling and balancing of multiprogrammed loads over heterogeneous processors

    SciTech Connect

    Haddad, E.

    1995-12-01

    The serial/parallel allocation of multiprogrammed load modules among heterogeneous virtual memory processors is formulated as an instance of the separable resource allocation problem with nonconvex functions. Mild conditions on the functions leads to an efficient solution in constant time.

  18. Optimization of Particle-in-Cell Codes on RISC Processors

    NASA Technical Reports Server (NTRS)

    Decyk, Viktor K.; Karmesin, Steve Roy; Boer, Aeint de; Liewer, Paulette C.

    1996-01-01

    General strategies are developed to optimize particle-cell-codes written in Fortran for RISC processors which are commonly used on massively parallel computers. These strategies include data reorganization to improve cache utilization and code reorganization to improve efficiency of arithmetic pipelines.

  19. Software-Reconfigurable Processors for Spacecraft

    NASA Technical Reports Server (NTRS)

    Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

    2005-01-01

    A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).

  20. Efficacy of Code Optimization on Cache-based Processors

    NASA Technical Reports Server (NTRS)

    VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)

    1997-01-01

    The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important

  1. Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

    SciTech Connect

    Liao, C; Quinlan, D J; Willcock, J J; Panas, T

    2008-12-12

    Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

  2. Finite element computation with parallel VLSI

    NASA Technical Reports Server (NTRS)

    Mcgregor, J.; Salama, M.

    1983-01-01

    This paper describes a parallel processing computer consisting of a 16-bit microcomputer as a master processor which controls and coordinates the activities of 8086/8087 VLSI chip set slave processors working in parallel. The hardware is inexpensive and can be flexibly configured and programmed to perform various functions. This makes it a useful research tool for the development of, and experimentation with parallel mathematical algorithms. Application of the hardware to computational tasks involved in the finite element analysis method is demonstrated by the generation and assembly of beam finite element stiffness matrices. A number of possible schemes for the implementation of N-elements on N- or n-processors (N is greater than n) are described, and the speedup factors of their time consumption are determined as a function of the number of available parallel processors.

  3. 3D-Flow processor for a programmable Level-1 trigger (feasibility study)

    SciTech Connect

    Crosetto, D.

    1992-10-01

    A feasibility study has been made to use the 3D-Flow processor in a pipelined programmable parallel processing architecture to identify particles such as electrons, jets, muons, etc., in high-energy physics experiments.

  4. Massively parallel mathematical sieves

    SciTech Connect

    Montry, G.R.

    1989-01-01

    The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

  5. Micromechanical resonator array for an implantable bionic ear.

    PubMed

    Bachman, Mark; Zeng, Fan-Gang; Xu, Tao; Li, G-P

    2006-01-01

    In this paper we report on a multi-resonant transducer that may be used to replace a traditional speech processor in cochlear implant applications. The transducer, made from an array of micro-machined polymer resonators, is capable of passively splitting sound into its frequency sub-bands without the need for analog-to-digital conversion and subsequent digital processing. Since all bands are mechanically filtered in parallel, there is low latency in the output signals. The simplicity of the device, high channel capability, low power requirements, and small form factor (less than 1 cm) make it a good candidate for a completely implantable bionic ear device.

  6. Interactive digital signal processor

    NASA Technical Reports Server (NTRS)

    Mish, W. H.; Wenger, R. M.; Behannon, K. W.; Byrnes, J. B.

    1982-01-01

    The Interactive Digital Signal Processor (IDSP) is examined. It consists of a set of time series analysis Operators each of which operates on an input file to produce an output file. The operators can be executed in any order that makes sense and recursively, if desired. The operators are the various algorithms used in digital time series analysis work. User written operators can be easily interfaced to the sysatem. The system can be operated both interactively and in batch mode. In IDSP a file can consist of up to n (currently n=8) simultaneous time series. IDSP currently includes over thirty standard operators that range from Fourier transform operations, design and application of digital filters, eigenvalue analysis, to operators that provide graphical output, allow batch operation, editing and display information.

  7. CoNNeCT Baseband Processor Module

    NASA Technical Reports Server (NTRS)

    Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.

    2011-01-01

    A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.

  8. Parallel Monte Carlo simulation of multilattice thin film growth

    NASA Astrophysics Data System (ADS)

    Shu, J. W.; Lu, Qin; Wong, Wai-on; Huang, Han-chen

    2001-07-01

    This paper describe a new parallel algorithm for the multi-lattice Monte Carlo atomistic simulator for thin film deposition (ADEPT), implemented on parallel computer using the PVM (Parallel Virtual Machine) message passing library. This parallel algorithm is based on domain decomposition with overlapping and asynchronous communication. Multiple lattices are represented by a single reference lattice through one-to-one mappings, with resulting computational demands being comparable to those in the single-lattice Monte Carlo model. Asynchronous communication and domain overlapping techniques are used to reduce the waiting time and communication time among parallel processors. Results show that the algorithm is highly efficient with large number of processors. The algorithm was implemented on a parallel machine with 50 processors, and it is suitable for parallel Monte Carlo simulation of thin film growth with either a distributed memory parallel computer or a shared memory machine with message passing libraries. In this paper, the significant communication time in parallel MC simulation of thin film growth is effectively reduced by adopting domain decomposition with overlapping between sub-domains and asynchronous communication among processors. The overhead of communication does not increase evidently and speedup shows an ascending tendency when the number of processor increases. A near linear increase in computing speed was achieved with number of processors increases and there is no theoretical limit on the number of processors to be used. The techniques developed in this work are also suitable for the implementation of the Monte Carlo code on other parallel systems.

  9. Enhancing Scalability of Parallel Structured AMR Calculations

    SciTech Connect

    Wissink, A M; Hysom, D; Hornung, R D

    2003-02-10

    This paper discusses parallel scaling performance of large scale parallel structured adaptive mesh refinement (SAMR) calculations in SAMRAI. Previous work revealed that poor scaling qualities in the adaptive gridding operations in SAMR calculations cause them to become dominant for cases run on up to 512 processors. This work describes algorithms we have developed to enhance the efficiency of the adaptive gridding operations. Performance of the algorithms is evaluated for two adaptive benchmarks run on up 512 processors of an IBM SP system.

  10. Parallel design patterns for a low-power, software-defined compressed video encoder

    NASA Astrophysics Data System (ADS)

    Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar

    2011-06-01

    Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.

  11. Parallel computation and computers for artificial intelligence

    SciTech Connect

    Kowalik, J.S. )

    1988-01-01

    This book discusses Parallel Processing in Artificial Intelligence; Parallel Computing using Multilisp; Execution of Common Lisp in a Parallel Environment; Qlisp; Restricted AND-Parallel Execution of Logic Programs; PARLOG: Parallel Programming in Logic; and Data-driven Processing of Semantic Nets. Attention is also given to: Application of the Butterfly Parallel Processor in Artificial Intelligence; On the Range of Applicability of an Artificial Intelligence Machine; Low-level Vision on Warp and the Apply Programming Mode; AHR: A Parallel Computer for Pure Lisp; FAIM-1: An Architecture for Symbolic Multi-processing; and Overview of Al Application Oriented Parallel Processing Research in Japan.

  12. FY 2006 Accomplishment Colony - "Services and Interfaces to Support Large Numbers of Processors"

    SciTech Connect

    Jones, T; Kale, L; Moreira, J; Mendes, C; Chakravorty, S; Tauferner, A; Inglett, T

    2006-06-30

    The Colony Project is developing operating system and runtime system technology to enable efficient general purpose environments on tens of thousands of processors. To accomplish this, we are investigating memory management techniques, fault management strategies, and parallel resource management schemes. Recent results show promising findings for scalable strategies based on processor virtualization, in-memory checkpointing, and parallel aware modifications to full featured operating systems.

  13. Parallel Genetic Algorithm for Alpha Spectra Fitting

    NASA Astrophysics Data System (ADS)

    García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio

    2005-01-01

    We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.

  14. Parallel machine architecture for production rule systems

    DOEpatents

    Allen, Jr., John D.; Butler, Philip L.

    1989-01-01

    A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

  15. Parallelization of the CI Program PEDICI

    NASA Astrophysics Data System (ADS)

    Thorsteinsson, Thorstein; Rettrup, Sten

    The general CI code PEDICI has been parallelized by decomposing the occurring summation over two-electron integrals. The parallelization was formulated in terms of a "master/slave'' model, and realized through use of the "PVM'' message passing facility. We have aimed at achieving a reasonably simple implementation for use on machines with intermediate numbers of processors. Exploratory test runs on an IBM SP supercomputer (consisting of RS/6000 model P2SC (120 MHz) nodes) show a very satisfactory performance increase with the number of processors used, as well as encouraging balancing of the workload. Our largest 32-processor test case gives a speed-up factor of 30.27.

  16. Temporal fringe pattern analysis with parallel computing

    SciTech Connect

    Tuck Wah Ng; Kar Tien Ang; Argentini, Gianluca

    2005-11-20

    Temporal fringe pattern analysis is invaluable in transient phenomena studies but necessitates long processing times. Here we describe a parallel computing strategy based on the single-program multiple-data model and hyperthreading processor technology to reduce the execution time. In a two-node cluster workstation configuration we found that execution periods were reduced by 1.6 times when four virtual processors were used. To allow even lower execution times with an increasing number of processors, the time allocated for data transfer, data read, and waiting should be minimized. Parallel computing is found here to present a feasible approach to reduce execution times in temporal fringe pattern analysis.

  17. Matrix preconditioning: a robust operation for optical linear algebra processors.

    PubMed

    Ghosh, A; Paparao, P

    1987-07-15

    Analog electrooptical processors are best suited for applications demanding high computational throughput with tolerance for inaccuracies. Matrix preconditioning is one such application. Matrix preconditioning is a preprocessing step for reducing the condition number of a matrix and is used extensively with gradient algorithms for increasing the rate of convergence and improving the accuracy of the solution. In this paper, we describe a simple parallel algorithm for matrix preconditioning, which can be implemented efficiently on a pipelined optical linear algebra processor. From the results of our numerical experiments we show that the efficacy of the preconditioning algorithm is affected very little by the errors of the optical system.

  18. Scalable parallel communications

    NASA Technical Reports Server (NTRS)

    Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

    1992-01-01

    Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth

  19. Efficient Breadth-First Search on the Cell/BE Processor

    SciTech Connect

    Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

    2008-10-01

    Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a breadth-first search (BFS) for advanced multi-core processors. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with a low-level implementation that embeds processor-specific optimizations. Using a fine-graind global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model we have determined an accurate performance model that has guided the implementation and the optimization of our algorithms. To validate our approach, we use a state-of-the-art multicore processor, the Cell Broadband Engine (Cell BE). Our experiments, obtained on a pre-production Cell BE board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. The Cell BE is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, an order of magnitude faster than the MTA-2 multi-threaded processor, and two orders of magnitude faster than a BlueGene/L processor. Index Terms—Multi-core processors, Parallel Computing, Cell Broadband Engine, Parallelization Techniques, Graph Exploration Algorithms, Breadth-First Search, BFS.

  20. Highly scalable linear solvers on thousands of processors.

    SciTech Connect

    Domino, Stefan Paul; Karlin, Ian; Siefert, Christopher; Hu, Jonathan Joseph; Robinson, Allen Conrad; Tuminaro, Raymond Stephen

    2009-09-01

    In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

  1. A Systolic Array Architecture For Processing Sonar Narrowband Signals

    NASA Astrophysics Data System (ADS)

    Mintzer, L.

    1988-07-01

    Modern sonars relay more upon visual rather than aural contacts. Lofargrams presenting a time history of hydrophone spectral content are standard means of observing narrowband signals. However, the frequency signal "tracks" are often embedded in noise, sometimes rendering their detection difficult and time consuming. Image enhancement algorithms applied to the 'grams can yield improvements in target data presented to the observer. A systolic array based on the NCR Geometric Arithmetic Parallel Processor (GAPP), a CMOS chip that contains 72 single bit processors controlled in parallel, has been designed for evaluating image enhancement algorithms. With the processing nodes of the GAPP bearing a one-to-one correspondence with the pixels displayed on the 'gram, a very efficient SIMD architecture is realized. The low data rate of sonar displays, i.e., one line of 1000-4000 pixels per second, and the 10-MHz control clock of the GAPP provide the possibility of 107 operations per pixel in real time applications. However, this architecture cannot handle data-dependent operations efficiently. To this end a companion processor capable of efficiently executing branch operations has been designed. A simple spoke filter is simulated and applied to laboratory data with noticeable improvements in the resulting lofargram display.

  2. Implementing clips on a parallel computer

    NASA Technical Reports Server (NTRS)

    Riley, Gary

    1987-01-01

    The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

  3. Randomized parallel speedups for list ranking

    SciTech Connect

    Vishkin, U.

    1987-06-01

    The following problem is considered: given a linked list of length n, compute the distance of each element of the linked list from the end of the list. The problem has two standard deterministic algorithms: a linear time serial algorithm, and an O(n log n)/ rho + log n) time parallel algorithm using rho processors. The authors present a randomized parallel algorithm for the problem. The algorithm is designed for an exclusive-read exclusive-write parallel random access machine (EREW PRAM). It runs almost surely in time O(n/rho + log n log* n) using rho processors. Using a recently published parallel prefix sums algorithm the list-ranking algorithm can be adapted to run on a concurrent-read concurrent-write parallel random access machine (CRCW PRAM) almost surely in time O(n/rho + log n) using rho processors.

  4. Parallel computation using limited resources

    SciTech Connect

    Sugla, B.

    1985-01-01

    This thesis addresses itself to the task of designing and analyzing parallel algorithms when the resources of processors, communication, and time are limited. The two parts of this thesis deal with multiprocessor systems and VLSI - the two important parallel processing environments that are prevalent today. In the first part a time-processor-communication tradeoff analysis is conducted for two kinds of problems - N input, 1 output, and N input, N output computations. In the class of problems of the second kind, the problem of prefix computation, an important problem due to the number of naturally occurring computations it can model, is studied. Finally, a general methodology is given for design of parallel algorithms that can be used to optimize a given design to a wide set of architectural variations. The second part of the thesis considers the design of parallel algorithms for the VLSI model of computation when the resource of time is severely restricted.

  5. Parallel algorithms for message decomposition

    SciTech Connect

    Teng, S.H.; Wang, B.

    1987-06-01

    The authors consider the deterministic and random parallel complexity (time and processor) of message decoding: an essential problem in communications systems and translation systems. They present an optimal parallel algorithm to decompose prefix-coded messages and uniquely decipherable-coded messages in O(n/P) time, using O(P) processors (for all P:1 less than or equal toPless than or equal ton/log n) deterministically as well as randomly on the weakest version of parallel random access machines in which concurrent read and concurrent write to a cell in the common memory are not allowed. This is done by reducing decoding to parallel finite-state automata simulation and the prefix sums.

  6. Distributed Data Flow Signal Processors

    NASA Astrophysics Data System (ADS)

    Eggert, Jay A.

    1982-12-01

    Near term advances in technology such as VHSIC promise revolutionary progress in programmable signal processor capabilities. However, meeting projected signal processing requirements for radar, sonar and other high throughput systems requires effective multi-processor networks. This paper describes a distributed signal processor architecture currently in development at Texas Instruments that is designed to meet these high through-put, multi-mode system requirements. The approach supports multiple, functionally spe-cialized, autonomous nodes (processors) interconnected via a flexible, high speed communication network. A common task scheduling mechanism based upon "data flow" concepts provides an efficient high level programming and simulation mechanism. The Ada syntax compatible task level programming and simulation software support tools are also described.

  7. Fully automatic telemetry data processor

    NASA Technical Reports Server (NTRS)

    Cox, F. B.; Keipert, F. A.; Lee, R. C.

    1968-01-01

    Satellite Telemetry Automatic Reduction System /STARS 2/, a fully automatic computer-controlled telemetry data processor, maximizes data recovery, reduces turnaround time, increases flexibility, and improves operational efficiency. The system incorporates a CDC 3200 computer as its central element.

  8. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, D.B.

    1996-12-31

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

  9. Electrostatically focused addressable field emission array chips (AFEA's) for high-speed massively parallel maskless digital E-beam direct write lithography and scanning electron microscopy

    DOEpatents

    Thomas, Clarence E.; Baylor, Larry R.; Voelkl, Edgar; Simpson, Michael L.; Paulus, Michael J.; Lowndes, Douglas H.; Whealton, John H.; Whitson, John C.; Wilgen, John B.

    2002-12-24

    Systems and methods are described for addressable field emission array (AFEA) chips. A method of operating an addressable field-emission array, includes: generating a plurality of electron beams from a pluralitly of emitters that compose the addressable field-emission array; and focusing at least one of the plurality of electron beams with an on-chip electrostatic focusing stack. The systems and methods provide advantages including the avoidance of space-charge blow-up.

  10. Acousto-optic/CCD real-time SAR data processor

    NASA Technical Reports Server (NTRS)

    Psaltis, D.

    1983-01-01

    The SAR processor which uses an acousto-optic device as the input electronic-to-optical transducer and a 2-D CCD image sensor, which is operated in the time-delay-and-integrate (TDI) mode is presented. The CCD serves as the optical detector, and it simultaneously operates as an array of optically addressed correlators. The lines of the focused SAR image form continuously (at the radar PRF) at the final row of the CCD. The principles of operation of this processor, its performance characteristics, the state-of-the-art of the devices used and experimental results are outlined. The methods by which this processor can be made flexible so that it can be dynamically adapted to changing SAR geometries is discussed.

  11. Parallel performance of the fine-grain pipeline FPGA image processing system

    NASA Astrophysics Data System (ADS)

    Gorgoń, M.

    2012-06-01

    The use of FPGA circuits in imaging systems increases. They compete with other computing environments. The article describes the indications to be followed while choosing the type of image processing computing system taking under consideration the advantages and disadvantages of each technology: general purpose processor, digital signal processor, graphical processing unit, application specific Integrated circuit and field programmable gate array. Attention is drawn to various video transmission standards. The state of research and development trends in the field of FPGA-based image processing are briefly presented. A defining processing performance method for image processing is proposed. It is proven that for a pipeline architecture implemented in FPGA, a linear speedup is achieved and parallel efficiency is equal to one.

  12. Accelerating the performance of a novel meshless method based on collocation with radial basis functions by employing a graphical processing unit as a parallel coprocessor

    NASA Astrophysics Data System (ADS)

    Owusu-Banson, Derek

    In recent times, a variety of industries, applications and numerical methods including the meshless method have enjoyed a great deal of success by utilizing the graphical processing unit (GPU) as a parallel coprocessor. These benefits often include performance improvement over the previous implementations. Furthermore, applications running on graphics processors enjoy superior performance per dollar and performance per watt than implementations built exclusively on traditional central processing technologies. The GPU was originally designed for graphics acceleration but the modern GPU, known as the General Purpose Graphical Processing Unit (GPGPU) can be used for scientific and engineering calculations. The GPGPU consists of massively parallel array of integer and floating point processors. There are typically hundreds of processors per graphics card with dedicated high-speed memory. This work describes an application written by the author, titled GaussianRBF to show the implementation and results of a novel meshless method that in-cooperates the collocation of the Gaussian radial basis function by utilizing the GPU as a parallel co-processor. Key phases of the proposed meshless method have been executed on the GPU using the NVIDIA CUDA software development kit. Especially, the matrix fill and solution phases have been carried out on the GPU, along with some post processing. This approach resulted in a decreased processing time compared to similar algorithm implemented on the CPU while maintaining the same accuracy.

  13. A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC (Superconducting Super Collider) detectors

    SciTech Connect

    Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C. ); Lockyer, N.; VanBerg, R. )

    1989-12-01

    A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab.

  14. Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming

    NASA Technical Reports Server (NTRS)

    Gentzsch, W.

    1982-01-01

    Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

  15. Interstitial fault tolerance-a technique for making systolic arrays fault tolerant

    SciTech Connect

    Kuhn, R.H.

    1983-01-01

    Systolic arrays are a popular model for the implementation of highly parallel VLSI systems. In this paper interstitial fault tolerance (IFT), a technique for incorporating fault tolerance into systolic arrays in a natural manner, is discussed. IFT can be used for reliable computation or for yield enhancement. Previous fault tolerance techniques for reliable computation on SIMD systems have employed redundant hardware. IFT on the other hand employs time redundancy. Previous wafer scale integration techniques for yield enhancement have been proposed only for linear processing element arrays. Ift is effective for both linear and two dimensional arrays. The time redundancy to achieve IFT is shown to be bounded by a factor of 3, allowing no processor redundancy. Results of monte carlo simulation of ift are presented. 19 references.

  16. Benchmarking NWP Kernels on Multi- and Many-core Processors

    NASA Astrophysics Data System (ADS)

    Michalakes, J.; Vachharajani, M.

    2008-12-01

    Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.

  17. ELIPS: Toward a Sensor Fusion Processor on a Chip

    NASA Technical Reports Server (NTRS)

    Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James

    1998-01-01

    The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.

  18. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    PubMed Central

    Sharma, Anuj; Manolakos, Elias S.

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  19. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    PubMed

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332

  20. PVM Enhancement for Beowulf Multiple-Processor Nodes

    NASA Technical Reports Server (NTRS)

    Springer, Paul

    2006-01-01

    A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

  1. Simultaneous multithreaded processor enhanced for multimedia applications

    NASA Astrophysics Data System (ADS)

    Mombers, Friederich; Thomas, Michel

    1999-12-01

    The paper proposes a new media processor architecture specifically designed to handle state-of-the-art multimedia encoding and decoding tasks. To achieve this, the architecture efficiently exploit Data-, Instruction- and Thread-Level parallelisms while continuously adapting its computational resources to reach the most appropriate parallelism level among all the concurrent encoding/decoding processes. Looking at the implementation constraints, several critical choices were adopted that solve the interconnection delay problem, lower the cache misses and pipeline stalls effects and reduce register files and memory size by adopting a clustered Simultaneous Multithreaded Architecture. We enhanced the classic model to exploit both Instruction and Data Level Parallelism through vector instructions. The vector extension is well justified for multimedia workload and improves code density, crossbars complexity, register file ports and decoding logic area while it still provides an efficient way to fully exploit a large set of functional units. An MPEG-2 encoding algorithms based on Hybrid Genetic search has been implemented that show the efficiency of the architecture to adapt its resources allocation to better fulfill the application requirements.

  2. Single-Point Access to Data Distributed on Many Processors

    NASA Technical Reports Server (NTRS)

    James, Mark

    2007-01-01

    A description of the functions and data structures is defined that would be necessary to implement the Chapel concept of distributions, domains, allocation, access, and interfaces to the compiler for transformations from Chapel source to their run-time implementation for these concepts. A complete set of object-oriented operators is defined that enables one to access elements of a distributed array through regular arithmetic index sets, giving the programmer the illusion that all the elements are collocated on a single processor. This means that arbitrary regions of the arrays can be fragmented and distributed across multiple processors with a single point of access. This is important because it can significantly improve programmer productivity by allowing the programmers to concentrate on the high-level details of the algorithm without worrying about the efficiency and communication details of the underlying representation.

  3. Parallel processing architecture for H.264 deblocking filter on multi-core platforms

    NASA Astrophysics Data System (ADS)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

    2012-03-01

    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  4. Proposed MIDAS II processing array

    SciTech Connect

    Meng, J.

    1982-03-01

    MIDAS (Modular Interactive Data Analysis System) is a ganged processor scheme used to interactively process large data bases occurring as a finite sequence of similar events. The existing device uses a system of eight ganged minicomputer central processor boards servicing a rotating group of 16 memory blocks. A proposal for MIDAS II, the successor to MIDAS, is to use a much larger number of ganged processors, one per memory block, avoiding the necessity of switching memories from processor to processor. To be economic, MIDAS II must use a small, relatively fast and inexpensive microprocessor, such as the TMS 9995. This paper analyzes the use of the TMS 9995 applied to the MIDAS II processing array, emphasizing computational, architectural and physical characteristics which make the use of the TMS 9995 attractive for this application.

  5. Network control processor for a TDMA system

    NASA Astrophysics Data System (ADS)

    Suryadevara, Omkarmurthy; Debettencourt, Thomas J.; Shulman, R. B.

    Two unique aspects of designing a network control processor (NCP) to monitor and control a demand-assigned, time-division multiple-access (TDMA) network are described. The first involves the implementation of redundancy by synchronizing the databases of two geographically remote NCPs. The two sets of databases are kept in synchronization by collecting data on both systems, transferring databases, sending incremental updates, and the parallel updating of databases. A periodic audit compares the checksums of the databases to ensure synchronization. The second aspect involves the use of a tracking algorithm to dynamically reallocate TDMA frame space. This algorithm detects and tracks current and long-term load changes in the network. When some portions of the network are overloaded while others have excess capacity, the algorithm automatically calculates and implements a new burst time plan.

  6. Parallel supercomputing today and the cedar approach.

    PubMed

    Kuck, D J; Davidson, E S; Lawrie, D H; Sameh, A H

    1986-02-28

    More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical; they will require even more advanced software. Compilers that restructure user programs to exploit the machine organization seem to be essential. A wide range of algorithms and applications is being developed in an effort to provide high parallel processing performance in many fields. The Cedar supercomputer, presently operating with eight processors in parallel, uses advanced system and applications software developed at the University of Illinois during the past 12 years. This software should allow the number of processors in Cedar to be doubled annually, providing rapid performance advances in the next decade. PMID:17740294

  7. A Parallel Vector Machine for the PM Programming Language

    NASA Astrophysics Data System (ADS)

    Bellerby, Tim

    2016-04-01

    PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using

  8. Parallel distributed free-space optoelectronic computer engine using flat plug-on-top optics package

    NASA Astrophysics Data System (ADS)

    Berger, Christoph; Ekman, Jeremy T.; Wang, Xiaoqing; Marchand, Philippe J.; Spaanenburg, Henk; Kiamilev, Fouad E.; Esener, Sadik C.

    2000-05-01

    We report about ongoing work on a free-space optical interconnect system, which will demonstrate a Fast Fourier Transformation calculation, distributed among six processor chips. Logically, the processors are arranged in two linear chains, where each element communicates optically with its nearest neighbors. Physically, the setup consists of a large motherboard, several multi-chip carrier modules, which hold the processor/driver chips and the optoelectronic chips (arrays of lasers and detectors), and several plug-on-top optics modules, which provide the optical links between the chip carrier modules. The system design tries to satisfy numerous constraints, such as compact size, potential for mass-production, suitability for large arrays (up to 1024 parallel channels), compatibility with standard electronics fabrication and packaging technology, potential for active misalignment compensation by integration MEMS technology, and suitability for testing different imaging topologies. We present the system architecture together with details of key components and modules, and report on first experiences with prototype modules of the setup.

  9. A Parallel Processing Algorithm for Gravity Inversion

    NASA Astrophysics Data System (ADS)

    Frasheri, Neki; Bushati, Salvatore; Frasheri, Alfred

    2013-04-01

    -bodies geosections, it was concluded that limitation of weighted least squares error gave better results in all cases, at the range of 3% - 6%. The typical used geosection was 4000m*4000m*2000m discretized with 11x11x6, 21x21x11 and 41x41x21 of 3D nodes. Bodies were represented by vertical prisms with section 400m*400m and different heights. The run-time of the single body geosection resulted up to several hours for a single processor computer for the geosection with 41x41x21 nodes. Parallel processing with OpenMP and MPI was used for geosections of 81x81x41 nodes (using finite cuboid elements with edge size 50m) in parallel systems of Bulgarian Academy of Sciences and of Super Computing Center of NIIFI in Hungary. Using up to 1,000 processors the run-time resulted about 24 hours, and it was evaluated that for a 3D array of 161x161x81 nodes (cuboids with edge 25m) the run time in 1,000 cores would be up to one year. The quality of inverted geosections resulted good in case of single body models, the algorithm offered clear contrast between the mass density of the body and the environment, and the shapes of original and inverted prisms resulted quite similar. In two body cases better solutions were obtained for shallow bodies, with the depth the tendency of the algorithm was to delineate only the shallow tops of prisms and compensate with a single mass at the depth. The algorithm was tested also with two real cases of typical gravity anomalies observed in Albanides.

  10. Integrated fuel processor development challenges.

    SciTech Connect

    Ahmed, S.; Pereira, Lee, S. H. D.; Kaun, T.; Krumpelt, M.

    2002-01-09

    In the absence of a hydrogen-refueling infrastructure, the success of the fuel cell system in the market will depend on fuel processors to enable the use of available fuels, such as gasoline, natural gas, etc. The fuel processor includes several catalytic reactors, scrubbers to remove chemical species that can poison downstream catalysts or the fuel cell electrocatalyst, and heat exchangers. Most fuel cell power applications seek compact, lightweight hardware with rapid-start and load- following capabilities. Although packaging can partially address the size and volume, balancing the performance parameters while maintaining the fuel conversion (to hydrogen) efficiency requires careful integration of the unit operations and processes. Argonne National Laboratory has developed integrated fuel processors that are compact and light, and that operate efficiently. This paper discusses some of the difficulties encountered in the development process, focusing on the factors/components that constrain performance, and areas that need further research and development.

  11. Sequence information signal processor for local and global string comparisons

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1997-01-01

    A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

  12. Scalable Parallel Algebraic Multigrid Solvers

    SciTech Connect

    Bank, R; Lu, S; Tong, C; Vassilevski, P

    2005-03-23

    The authors propose a parallel algebraic multilevel algorithm (AMG), which has the novel feature that the subproblem residing in each processor is defined over the entire partition domain, although the vast majority of unknowns for each subproblem are associated with the partition owned by the corresponding processor. This feature ensures that a global coarse description of the problem is contained within each of the subproblems. The advantages of this approach are that interprocessor communication is minimized in the solution process while an optimal order of convergence rate is preserved; and the speed of local subproblem solvers can be maximized using the best existing sequential algebraic solvers.

  13. MBASIC batch processor architectural overview

    NASA Technical Reports Server (NTRS)

    Reynolds, S. M.

    1978-01-01

    The MBASIC (TM) batch processor, a language translator designed to operate in the MBASIC (TM) environment is described. Features include: (1) a CONVERT TO BATCH command, usable from the ready mode; and (2) translation of the users program in stages through several levels of intermediate language and optimization. The processor is to be designed and implemented in both machine-independent and machine-dependent sections. The architecture is planned so that optimization processes are transparent to the rest of the system and need not be included in the first design implementation cycle.

  14. Parallel first-order linear recurrence solver

    SciTech Connect

    Meyer, G.G.L.; Podrazik, L.

    1987-04-01

    In this paper the authors present a parallel procedure for the solution of first-order linear recurrence systems of size N when the number or processors rho is small in relation to N. They show that when 1 < rho/sup 2/ less than or equal to N, a first-order linear recurrence system of size N can be solved in 5(N - 1)(rho + 1) steps on a p processor SIMD machine and at most 5(N - 1/2)/(rho + 3/2) steps on a p processor MIMD machine. As a special case, they show that their approach precisely achieves the lower bound 2(N - 1)/(rho + 1) for solving the parallel prefix problem on a p processor machine.

  15. Parallel 3-D method of characteristics in MPACT

    SciTech Connect

    Kochunas, B.; Dovvnar, T. J.; Liu, Z.

    2013-07-01

    A new parallel 3-D MOC kernel has been developed and implemented in MPACT which makes use of the modular ray tracing technique to reduce computational requirements and to facilitate parallel decomposition. The parallel model makes use of both distributed and shared memory parallelism which are implemented with the MPI and OpenMP standards, respectively. The kernel is capable of parallel decomposition of problems in space, angle, and by characteristic rays up to 0(104) processors. Initial verification of the parallel 3-D MOC kernel was performed using the Takeda 3-D transport benchmark problems. The eigenvalues computed by MPACT are within the statistical uncertainty of the benchmark reference and agree well with the averages of other participants. The MPACT k{sub eff} differs from the benchmark results for rodded and un-rodded cases by 11 and -40 pcm, respectively. The calculations were performed for various numbers of processors and parallel decompositions up to 15625 processors; all producing the same result at convergence. The parallel efficiency of the worst case was 60%, while very good efficiency (>95%) was observed for cases using 500 processors. The overall run time for the 500 processor case was 231 seconds and 19 seconds for the case with 15625 processors. Ongoing work is focused on developing theoretical performance models and the implementation of acceleration techniques to minimize the number of iterations to converge. (authors)

  16. An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors

    NASA Technical Reports Server (NTRS)

    Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

    2015-01-01

    This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.

  17. Parallel implementation of an algorithm for Delaunay triangulation

    NASA Technical Reports Server (NTRS)

    Merriam, Marshal L.

    1992-01-01

    The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.

  18. Parallel implementation of an algorithm for Delaunay triangulation

    NASA Technical Reports Server (NTRS)

    Merriam, Marshall L.

    1992-01-01

    This work concerns the theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer. Tanemura's algorithm does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a conventional, vector processing, supercomputer is problematic. Efficient implementation on a parallel architecture is possible, however. In this work, speeds in excess of 8 times a single processor Cray Y-mp are realized on 128 processors of the Intel Gamma prototype.

  19. Portable parallel programming in a Fortran environment

    SciTech Connect

    May, E.N.

    1989-01-01

    Experience using the Argonne-developed PARMACs macro package to implement a portable parallel programming environment is described. Fortran programs with intrinsic parallelism of coarse and medium granularity are easily converted to parallel programs which are portable among a number of commercially available parallel processors in the class of shared-memory bus-based and local-memory network based MIMD processors. The parallelism is implemented using standard UNIX (tm) tools and a small number of easily understood synchronization concepts (monitors and message-passing techniques) to construct and coordinate multiple cooperating processes on one or many processors. Benchmark results are presented for parallel computers such as the Alliant FX/8, the Encore MultiMax, the Sequent Balance, the Intel iPSC/2 Hypercube and a network of Sun 3 workstations. These parallel machines are typical MIMD types with from 8 to 30 processors, each rated at from 1 to 10 MIPS processing power. The demonstration code used for this work is a Monte Carlo simulation of the response to photons of a ''nearly realistic'' lead, iron and plastic electromagnetic and hadronic calorimeter, using the EGS4 code system. 6 refs., 2 figs., 2 tabs.

  20. A generic fine-grained parallel C

    NASA Technical Reports Server (NTRS)

    Hamet, L.; Dorband, John E.

    1988-01-01

    With the present availability of parallel processors of vastly different architectures, there is a need for a common language interface to multiple types of machines. The parallel C compiler, currently under development, is intended to be such a language. This language is based on the belief that an algorithm designed around fine-grained parallelism can be mapped relatively easily to different parallel architectures, since a large percentage of the parallelism has been identified. The compiler generates a FORTH-like machine-independent intermediate code. A machine-dependent translator will reside on each machine to generate the appropriate executable code, taking advantage of the particular architectures. The goal of this project is to allow a user to run the same program on such machines as the Massively Parallel Processor, the CRAY, the Connection Machine, and the CYBER 205 as well as serial machines such as VAXes, Macintoshes and Sun workstations.

  1. Parallelization of the Pipelined Thomas Algorithm

    NASA Technical Reports Server (NTRS)

    Povitsky, A.

    1998-01-01

    In this study the following questions are addressed. Is it possible to improve the parallelization efficiency of the Thomas algorithm? How should the Thomas algorithm be formulated in order to get solved lines that are used as data for other computational tasks while processors are idle? To answer these questions, two-step pipelined algorithms (PAs) are introduced formally. It is shown that the idle processor time is invariant with respect to the order of backward and forward steps in PAs starting from one outermost processor. The advantage of PAs starting from two outermost processors is small. Versions of the pipelined Thomas algorithms considered here fall into the category of PAs. These results show that the parallelization efficiency of the Thomas algorithm cannot be improved directly. However, the processor idle time can be used if some data has been computed by the time processors become idle. To achieve this goal the Immediate Backward pipelined Thomas Algorithm (IB-PTA) is developed in this article. The backward step is computed immediately after the forward step has been completed for the first portion of lines. This enables the completion of the Thomas algorithm for some of these lines before processors become idle. An algorithm for generating a static processor schedule recursively is developed. This schedule is used to switch between forward and backward computations and to control communications between processors. The advantage of the IB-PTA over the basic PTA is the presence of solved lines, which are available for other computations, by the time processors become idle.

  2. A water-resistant speech processor.

    PubMed

    Gibson, Peter; Capcelea, Edmond; Darley, Ian; Leavens, Jason; Parker, John

    2006-09-01

    Cochlear implant systems are used in diverse environments and should function during work, exercise and play as people go about their daily lives. This is a demanding requirement, with exposure to liquid and other contaminant ingress from many sources. For reliability, it is desirable that the speech processor withstands these exposures. This design challenge has been addressed in the Nucleus(R) Freedom(TM) speech processor. The Nucleus Freedom speech processor complies with International Standard IEC 60529, as independently certified. Tests include spraying the processor with water followed by immediate verification of functionality including microphone response, radio frequency link and processor controls. The processor has met level IP44 of the Standard.

  3. Implementation and use of systolic array processes

    SciTech Connect

    Kung, H.T.

    1983-01-01

    Major effort are now underway to use systolic array processors in large, real-life applications. The author examines various implementation issues and alternatives, the latter from the viewpoints of flexibility and interconnection topologies. He then identifies some work that is essential to the eventual wide use of systolic array processors, such as the development of building blocks, system support and suitable algorithms. 24 references.

  4. Bipartite memory network architectures for parallel processing

    SciTech Connect

    Smith, W.; Kale, L.V. . Dept. of Computer Science)

    1990-01-01

    Parallel architectures are boradly classified as either shared memory or distributed memory architectures. In this paper, the authors propose a third family of architectures, called bipartite memory network architectures. In this architecture, processors and memory modules constitute a bipartite graph, where each processor is allowed to access a small subset of the memory modules, and each memory module allows access from a small set of processors. The architecture is particularly suitable for computations requiring dynamic load balancing. The authors explore the properties of this architecture by examining the Perfect Difference set based topology for the graph. Extensions of this topology are also suggested.

  5. Scalable load balancing for massively parallel distributed Monte Carlo particle transport

    SciTech Connect

    O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

    2013-07-01

    In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrence Livermore National Laboratory. (authors)

  6. SIMD-parallel understanding of natural language with application to magnitude-only optical parsing of text

    NASA Astrophysics Data System (ADS)

    Schmalz, Mark S.

    1992-08-01

    A novel parallel model of natural language (NL) understanding is presented which can realize high levels of semantic abstraction, and is designed for implementation on synchronous SIMD architectures and optical processors. Theory is expressed in terms of the Image Algebra (IA), a rigorous, concise, inherently parallel notation which unifies the design, analysis, and implementation of image processing algorithms. The IA has been implemented on numerous parallel architectures, and IA preprocessors and interpreters are available for the FORTRAN and Ada languages. In a previous study, we demonstrated the utility of IA for mapping MEA- conformable (Multiple Execution Array) algorithms to optical architectures. In this study, we extend our previous theory to map serial parsing algorithms to the synchronous SIMD paradigm. We initially derive a two-dimensional image that is based upon the adjacency matrix of a semantic graph. Via IA template mappings, the operations of bottom-up parsing, semantic disambiguation, and referential resolution are implemented as image-processing operations upon the adjacency matrix. Pixel-level operations are constrained to Hadamard addition and multiplication, thresholding, and row/column summation, which are available in magnitude-only optics. Assuming high parallelism in the parse rule base, the parsing of n input symbols with a grammar consisting of M rules of arity H, on an N-processor architecture, could exhibit time complexity of T(n) parallelism, the computational cost is constant and of order H. Since H < < n is typical, we claim a fundamental complexity advantage over the current O(n) theoretical time limit of MIMD parsing architectures. Additionally, we show that inference over a semantic net is achievable is parallel in O(m) time, where m corresponds to the depth of the search tree. Results are evaluated in terms of computational cost on SISD and SIMD processors

  7. Time Data Sequential Processor /TDSP/

    NASA Technical Reports Server (NTRS)

    Joseph, A. E.; Pavlovitch, T.; Roth, R. Y.; Sturms, F. M., Jr.

    1970-01-01

    Time Data Sequential Processor /TDSP/ computer program provides preflight predictions for lunar trajectories from injection to impact, and for planetary escape trajectories for up to 100 hours from launch. One of the major options TDSP performs is the determination of tracking station view periods.

  8. Processor Emulator with Benchmark Applications

    SciTech Connect

    Lloyd, G. Scott; Pearce, Roger; Gokhale, Maya

    2015-11-13

    A processor emulator and a suite of benchmark applications have been developed to assist in characterizing the performance of data-centric workloads on current and future computer architectures. Some of the applications have been collected from other open source projects. For more details on the emulator and an example of its usage, see reference [1].

  9. A Course on Reconfigurable Processors

    ERIC Educational Resources Information Center

    Shoufan, Abdulhadi; Huss, Sorin A.

    2010-01-01

    Reconfigurable computing is an established field in computer science. Teaching this field to computer science students demands special attention due to limited student experience in electronics and digital system design. This article presents a compact course on reconfigurable processors, which was offered at the Technische Universitat Darmstadt,…

  10. Children, Word Processors and Genre.

    ERIC Educational Resources Information Center

    Bowman, Marcus

    1999-01-01

    Students aged 11-12, placed in groups of four, used word processors to write about a dramatized event using persuasive, newspaper, recount, or report styles. Students' talk as they engaged in the task was analyzed to illuminate the linguistic and cognitive processes involved in group construction of text. (Author/SV)

  11. Compact hohlraum configuration with parallel planar-wire-array x-ray sources at the 1.7-MA Zebra generator.

    PubMed

    Kantsyrev, V L; Chuvatin, A S; Rudakov, L I; Velikovich, A L; Shrestha, I K; Esaulov, A A; Safronova, A S; Shlyaptseva, V V; Osborne, G C; Astanovitsky, A L; Weller, M E; Stafford, A; Schultz, K A; Cooper, M C; Cuneo, M E; Jones, B; Vesey, R A

    2014-12-01

    A compact Z-pinch x-ray hohlraum design with parallel-driven x-ray sources is experimentally demonstrated in a configuration with a central target and tailored shine shields at a 1.7-MA Zebra generator. Driving in parallel two magnetically decoupled compact double-planar-wire Z pinches has demonstrated the generation of synchronized x-ray bursts that correlated well in time with x-ray emission from a central reemission target. Good agreement between simulated and measured hohlraum radiation temperature of the central target is shown. The advantages of compact hohlraum design applications for multi-MA facilities are discussed.

  12. Parallelized direct execution simulation of message-passing parallel programs

    NASA Technical Reports Server (NTRS)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

    1994-01-01

    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  13. Parallel processing data network of master and slave transputers controlled by a serial control network

    DOEpatents

    Crosetto, Dario B.

    1996-01-01

    The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

  14. Optimal processor allocation for sort-last compositing under BSP-tree ordering

    NASA Astrophysics Data System (ADS)

    Ramakrishnan, C. R.; Silva, Claudio T.

    1999-03-01

    In this paper, we consider a parallel rendering model that exploits the fundamental distinction between rendering and compositing operations, by assigning processors from specialized pools for each of these operations. Our motivation is to support the parallelization of general scan-line rendering algorithms with minimal effort, basically by supporting a compositing back-end (i.e., a sort-last architecture) that is able to perform user-controlled image composition. Our computational model is based on organizing rendering as well as compositing processors on a BSP-tree, whose internal nodes we call the compositing tree. Many known rendering algorithms, such as volumetric ray casting and polygon rendering can be easily parallelized based on the structure of the BSP-tree. In such a framework, it is paramount to minimize the processing power devoted to compositing, by minimizing the number of processors allocated for composition as well as optimizing the individual compositing operations. In this paper, we address the problems related to the static allocation of processor resources to the compositing tree. In particular, we present an optimal algorithm to allocate compositing operations to compositing processors. We also present techniques to evaluate the compositing operations within each processor using minimum memory while promoting concurrency between computation and communication. We describe the implementation details and provide experimental evidence of the validity of our techniques in practice.

  15. Parallel network simulations with NEURON.

    PubMed

    Migliore, M; Cannia, C; Lytton, W W; Markram, Henry; Hines, M L

    2006-10-01

    The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2,000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored.

  16. The application of systolic arrays to radar signal processing

    NASA Astrophysics Data System (ADS)

    Spearman, R.; Spracklen, C. T.; Miles, J. H.

    The design of a systolic array processor radar system is examined, and its performance is compared to that of a conventional radar processor. It is shown how systolic arrays can be used to replace the boards of high speed logic normally associated with a high performance radar and to implement all of the normal processing functions associated with such a system. Multifunctional systolic arrays are presented that have the flexibility associated with a general purpose digital processor but the speed associated with fixed function logic arrays.

  17. 7 CFR 1208.18 - Processor.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE PROCESSED RASPBERRY PROMOTION, RESEARCH, AND INFORMATION ORDER Processed Raspberry Promotion, Research, and Information Order Definitions § 1208.18 Processor. Processor means a person engaged in the preparation of raspberries for...

  18. 7 CFR 1208.18 - Processor.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE PROCESSED RASPBERRY PROMOTION, RESEARCH, AND INFORMATION ORDER Processed Raspberry Promotion, Research, and Information Order Definitions § 1208.18 Processor. Processor means a person engaged in the preparation of raspberries for...

  19. Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

    NASA Astrophysics Data System (ADS)

    Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

    2013-03-01

    With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

  20. Parallelization of Edge Detection Algorithm using MPI on Beowulf Cluster

    NASA Astrophysics Data System (ADS)

    Haron, Nazleeni; Amir, Ruzaini; Aziz, Izzatdin A.; Jung, Low Tan; Shukri, Siti Rohkmah

    In this paper, we present the design of parallel Sobel edge detection algorithm using Foster's methodology. The parallel algorithm is implemented using MPI message passing library and master/slave algorithm. Every processor performs the same sequential algorithm but on different part of the image. Experimental results conducted on Beowulf cluster are presented to demonstrate the performance of the parallel algorithm.