Note: This page contains sample records for the topic parallel processor array from Science.gov.
While these samples are representative of the content of Science.gov,
they are not comprehensive nor are they the most current set.
We encourage you to perform a real-time search of Science.gov
to obtain the most current and comprehensive results.
Last update: November 12, 2013.
1

Sparse Matrix Computations on Parallel Processor Arrays  

Microsoft Academic Search

Weinvestigate the balancing of distributed compressed storage of large sparse matrices on a massively parallel computer. For fast computation of matrix#vector and matrix#matrix products on a rectangular processor array with e#cient communications along its rows and columns we require that the nonzero elements of each matrix row or column be distributed among the processors located within the same array row

Andrew T. Ogielski; William Aiello

1992-01-01

2

Design Space Exploration for Massively Parallel Processor Arrays  

Microsoft Academic Search

In this paper, we describe an approach for the optimiza- tion of dedicated co-processors that are implemented either in hardware (ASIC) or congware (FPGA). Such massively parallel co-processors are typically part of a heterogeneous hardware\\/software-system. Each co- processor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational

Frank Hannig; Jürgen Teich

2001-01-01

3

Titanic: a VLSI based content addressable parallel array processor  

SciTech Connect

A design is presented for a content addressable parallel array processor (CAPAP) which is both practical and feasible. Its practicality stems from an extensive program of research into real applications of content addressability and parallelism. The feasibility of the design stems from development under a set of conservative engineering constraints tied to limitations of VLSI technology. 1 ref.

Weems, C.; Levitan, S.; Foster, C.

1982-01-01

4

Integration of IR focal plane arrays with massively parallel processor  

NASA Astrophysics Data System (ADS)

The intent of this investigation is to replace the low fill factor visible sensor of a Cellular Neural Network (CNN) processor with an InGaAs Focal Plane Array (FPA) using both bump bonding and epitaxial layer transfer techniques for use in the Ballistic Missile Defense System (BMDS) interceptor seekers. The goal is to fabricate a massively parallel digital processor with a local as well as a global interconnect architecture. Currently, this unique CNN processor is capable of processing a target scene in excess of 10,000 frames per second with its visible sensor. What makes the CNN processor so unique is that each processing element includes memory, local data storage, local and global communication devices and a visible sensor supported by a programmable analog or digital computer program.

Esfandiari, P.; Koskey, P.; Vaccaro, K.; Buchwald, W.; Clark, F.; Krejca, B.; Rekeczky, C.; Zarandy, A.

2008-05-01

5

VLSI Array processors  

Microsoft Academic Search

High speed signal processing depends critically on parallel processor technology. In most applications, general-purpose parallel computers cannot offer satisfactory real-time processing speed due to severe system overhead. Therefore, for real-time digital signal processing (DSP) systems, special-purpose array processors have become the only appealing alternative. In designing or using such array Processors, most signal processing algorithms share the critical attributes of

S. Kung

1985-01-01

6

Design of a Bit-Serial Floating Point Unit for a Fine Grained Parallel Processor Array  

Microsoft Academic Search

This paper presents the design of a new bit-serial floating-point unit (FPU). It has been developed for the processors of the Instruction Systolic Array parallel computer model. In contrast to conventional bit-parallel FPUs the bit-serial approach requires different data formats. Our FPU uses an IEEE compliant internal floating point format that allows a fast least significant bit (LSB)-first arithmetic and

Manfred Schimmler; Bertil Schmidt; Hans-werner Lang

2003-01-01

7

Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array  

NASA Astrophysics Data System (ADS)

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-05-01

8

A four-processor building block for SIMD processor arrays  

Microsoft Academic Search

A four-processor chip, for use in processor arrays for image computations, is described. The large degree of data parallelism available in image computations allows dense array implementations where all processors operate under the control of a single instruction stream. An instruction decoder shared by the four processors on the chip minimizes the pin count allocated for global control of the

A. L. Fisher; P. T. Highnam; T. E. Rockoff

1990-01-01

9

VLSI array processor chip set  

SciTech Connect

The author describes the Honeywell array processor (HAP), a parallel pipelined 16-bit fixed point machine which is well balanced to optimize its throughput for signal processing algorithms such as fft, convolution, and filtering. 1 reference.

Mylet, P.

1983-01-01

10

PARALLEL IMPLEMENTATION OF FINITE DIFFERENCE SCHEMES FOR THE PLATE EQUATION ON A FPGA-BASED MULTI-PROCESSOR ARRAY  

Microsoft Academic Search

The computational complexity of the finite difference (FD) schemes for the solution of the plate equation prevents them from being used in musical applications. The explicit FD schemes can be parallelized to run on multi-processor ar- rays for achieving real-time performance. Field Program- mable Gate Arrays (FPGAs) provide an ideal platform for implementing these architectures with the advantages of low-

E. Motuk; R. Woods; S. Bilbao

11

Performance Bounds for Parallel Processors.  

National Technical Information Service (NTIS)

A general model of computation on a p-parallel processor is proposed, distinguishing clearly between the logical parallelism (p* processes) inherent in a computation, and the physical parallelism (p processor) available in the computer organization. This ...

R. B. L. Lee

1976-01-01

12

Array processors in chemistry  

SciTech Connect

The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

Ostlund, N.S.

1980-01-01

13

Languages for Parallel Processors  

NASA Astrophysics Data System (ADS)

The effective programming of parallel computers is much more complex then the programming of conventional serial computers. There are two fundamental models of highly parallel computer architectures: single instruction stream-multiple data stream in which a single program control unit is used to control a set of slave processing elements and multiple instruction stream-multiple data stream in which a set of interconnected independent processors cooperate on a single task. The high level programming language constructs appropriate for each model are discussed.

Reeves, A. P.

14

Array processor supercomputers  

SciTech Connect

Array processor supercomputers achieve their supercomputer performance by connecting massive numbers (64K) of relatively simple processors together. Commercially available machines have computation rates of up to 4 billion operations per second. These large computation rates have generated a substantial body of research investigating their usefulness for computationally intensive tasks such as image processing. These machines, however, also hold promise of efficient execution of non-numerical algorithms because of their ability to perform massive searching in constant time, often eliminating the need for ordering and complex data structures such as those using pointers. This paper describes the range of hardware variations of array processors, comparing and contrasting the significant differences among them, as well as briefly illustrating the wide range of algorithms that can effectively utilize them.

Potter, J.L.; Meilander, W.C. (Kent State Univ., OH (USA). Dept. of Mathematical Sciences)

1989-12-01

15

Delft Parallel Processor 84/16.  

National Technical Information Service (NTIS)

The development of the Delft Parallel Processor started in 1976. Since then the machine has grown to a processor with 16 independently operating, but tightly connected processing elements. The parallel processor is supervised by a host processor. In the f...

J. H. M. Andriessen

1986-01-01

16

Parallel data processor  

US Patent & Trademark Office Database

A parallel processor has a controller for generating control signals, and a plurality of identical processing cells, each of which is connected to at least one neighboring cell and responsive to the controller for processing data in accordance with the control signals. Each processing cell includes a memory, a first register, a second register, and an arithmetic logic unit (ALU). An input of the first register is coupled to a memory output. The output of the first register is coupled to a second register located in a neighboring cell. An input of the second register is coupled to receive an output from a first register located in a neighboring cell. The output of the second register is coupled to an input of the ALU. In another feature, mask logic is interposed between A and B operand sources, and two inputs of the ALU. The mask logic also inputs a mask source, and in response to control signals, can output the A operand logically OR'ed with the mask, and can output the B operand logically AND'ed with the mask. In another feature, each cell includes a multiplexor coupled to a neighboring cell for selectively transmitting cell data to the neighbor, or for effectively bypassing the cell during data shift operations by transmitting data that is received from a neighboring cell to a neighboring cell. Other enhancements to a cell architecture are also disclosed.

2000-06-06

17

Supporting dynamic parallel object arrays  

Microsoft Academic Search

We present efficient support for generalized arrays of parallel data driven objects. The “array elements” are scattered across a parallel machine. Each array element is an object that can be thought of as a virtual processor. The individual elements are addressed by their “index”, which can be an arbitrary object rather than a simple integer. For example, it can be

Orion Sky Lawlor; Laxmikant V. Kalé

2001-01-01

18

Skewless optical data-link subsystem for massively parallel processors using 8 Gb\\/s×1.1 Gb\\/s MMF array optical module  

Microsoft Academic Search

A high-capacitance, compact, low-cost, and convenient optical data-link subsystem for parallel computers is developed using an 8 Gb\\/s×1.1 Gb\\/s multimode fiber (MMF) array optical module and newly developed data-link IC. Although the subsystem uses an MMF ribbon, parallel data is successfully transmitted over 1 km due to ±15-ns deskew operation of the IC. The subsystem operated stably in a processor

Takashi Yoshikawa; Sohichiro Araki; Kazunori Miyoshi; Yoshihiko Suemura; Naoya Henmi; Takeshi Nagahori; Hiroshi Matsuoka; Takashi Yokota

1997-01-01

19

A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation  

Microsoft Academic Search

The future of high-performance computing is likely to rely on the ability to efficiently exploit huge amounts of paral- lelism. One way of taking advantage of this parallelism is to formulate problems as \\

David Barrie Thomas; Lee Howes; Wayne Luk

2009-01-01

20

Fault-Tolerant Array Processors Using Single-Track Switches  

Microsoft Academic Search

An array processor is a collection of many similar processing elements (PED's), which can be executed in both parallel and pipeline processing. For the implementation of arrays of large number of processors, fault tolerance has always been a very critical design issue. Very often, spare PE's and switching lattices are incorporated in the array to improve the (fabrication-time) yield and

Sun-yuan Kung; Shiann-ning Jean; Chih-wei Jim Chang

1989-01-01

21

Molecular mechanics with an array processor  

NASA Astrophysics Data System (ADS)

In recent years molecular mechanics, the computer simulation of molecular systems using molecular dynamics, Monte Carlo, and energy minimization, has emerged as a powerful tool for investigating and understanding chemical properties and processes. In this paper we discuss a particular solution to the computational needs of molecular mechanics, the use of specialized hardware, a high speed array processor, in our case a Floating Point Systems, Inc. AP-120B. In other papers, we have also discussed an alternative solution, the division of the problem among an array of different processors operating in parallel. Yet a third solution is to use a vector processing machine such as a Cray-1. We will focus here primarily on molecular dynamics within the array processor program package for molecular mechanics we have developed, called Newton. In section II we examine the architecture of the AP-120B. In section III we lay out the specifics of the structure of the program package we have developed for molecular dynamics. Finally, in section IV we present the results, analyze potential array processor improvements, and point out the advantages and importance of the environment in terms of language and operating system.

Berens, P. H.; Wilson, K. R.

1982-06-01

22

Instruction scheduling for instruction level parallel processors  

Microsoft Academic Search

Nearly all personal computer and workstation processors, and virtually all high-performance embedded processor cores, now embody instruction level parallel (ILP) processing in the form of superscalar or very long instruction word (VLIW) architectures. ILP processors put much more of a burden on compilers; without \\

PAOLO FARABOSCHI; JOSEPH A. FISHER; CLIFF YOUNG

2001-01-01

23

A four-processor building block for SIMD processor arrays  

Microsoft Academic Search

A four-processor chip, for use in processor arrays for image computations, is described. The full-custom 2-?m CMOS chip contains 56669 transistors and runs instructions at 10 MHz. 512 16-bit processors and external memory fit on two industry standard cards to yield 5-GIPS (billions of instructions\\/s) peak throughput

A. L. Fisher; P. T. Highnam; T. E. Rockoff

1989-01-01

24

Globality and speed of optical parallel processors.  

PubMed

The chances of optical computing are probably best if a large number of processing elements act in parallel. The efficiency of parallel processors depends, among other things, on the time it takes to communicate signals from one processor to any other processor. In an optical parallel processor one hopes to be able to transmit a signal from one processor to any other processor within only one cycle period, no matter how far apart the processors are. Such a global communications network is desirable especially for algorithms with global interactions. The fast Fourier algorithm is an example. We define a degree of globality and we show how speed and globality are related. Our result applies to a specific architecture based on spatial filtering. PMID:20555787

Lohmann, A W; Marathay, A S

1989-09-15

25

Bibliographic Pattern Matching Using the ICL Distributed Array Processor.  

ERIC Educational Resources Information Center

|Describes the use of a highly parallel array processor for pattern matching operations in a bibliographic retrieval system. The discussion covers the hardware and software features of the processor, the pattern matching algorithm used, and the results of experimental tests of the system. (37 references) (Author/CLB)|

Carroll, David M.; And Others

1988-01-01

26

Image processing using one-dimensional processor arrays  

Microsoft Academic Search

The first half of this paper presents the design rationale for CNAPS, a specialized one-dimensional (1-D) processor array developed by Adaptive Solutions Inc. In this context, we discuss the problem of Amdahl's law which severely constrains special-purpose architectures. We also discuss specific architectural decisions such as the kind of parallelism, the computational precision of the processors, on-chip versus off-chip processor

DAN W. HAMMERSTROM; DANIEL P. LULICH

1996-01-01

27

Allocating Independent Subtasks on Parallel Processors  

Microsoft Academic Search

When using MIMD (multiple instruction, multiple data) parallel computers, one is often confronted with solving a task composed of many independent subtasks where it is necessary to synchronize the processors after all the subtasks have been completed. This paper studies how the subtasks should be allocated to the processors in order to minimize the expected time it takes to finish

Clyde P. Kruskal; Alan Weiss

1985-01-01

28

Scheduling Independent Tasks on Parallel Processors  

Microsoft Academic Search

This paper considers the problem of scheduling m independent, immediately available tasks on n parallel processors. Each task has a waiting cost rate, that is a function of time, and a service time. There are no feasibility restrictions on the order in which the tasks are to be processed. An optimal scheduling rule is presented for the single processor scheduling

Michael H. Rothkopf

1966-01-01

29

Parallel computing on Unix workstation arrays  

NASA Astrophysics Data System (ADS)

We have tested arrays of general-purpose Unix workstations used as MIMD systems for massive parallel computations. In particular we have solved numerically a demanding test problem with a 2D hydrodynamic code, generally developed to study astrophysical flows, by exucuting it on arrays either of DECstations 5000/200 on Ethernet LAN, or of DECstations 3000/400, equipped with powerful Alpha processors, on FDDI LAN. The code is appropriate for data-domain decomposition, and we have used a library for parallelization previously developed in our Institute, and easily extended to work on Unix workstation arrays by using the PVM software toolset. We have compared the parallel efficiencies obtained on arrays of several processors to those obtained on a dedicated MIMD parallel system, namely a Meiko Computing Surface (CS-1), equipped with Intel i860 processors. We discuss the feasibility of using non-dedicated parallel systems and conclude that the convenience depends essentially on the size of the computational domain as compared to the relative processor power and network bandwidth. We point out that for future perspectives a parallel development of processor and network technology is important, and that the software still offers great opportunities of improvement, especially in terms of latency times in the message-passing protocols. In conditions of significant gain in terms of speedup, such workstation arrays represent a cost-effective approach to massive parallel computations.

Reale, F.; Bocchino, F.; Sciortino, S.

1994-12-01

30

Embedding pyramids in array processors with pipelined busses  

Microsoft Academic Search

The concept of pipelined buses for parallel architectures diverges from the conventional exclusive access buses and offers both possibilities and challenges for significantly improving the efficiency of interprocessor communications in parallel computers. The authors present an efficient embedding of pyramids in array processors with pipelined buses. The embedding has the property that all the neighboring nodes in the pyramid are

Zicheng Guo; Rami G. Melhem

1990-01-01

31

Multiple-Fold Clustered Processor Mesh Array.  

National Technical Information Service (NTIS)

The multiple-fold clustered processor mesh array is a triangular organization of clustered processing elements. This multiple-fold array maintains functional equivalence to the nearest neighbor mesh computer with uni-directional interprocessor communicati...

G. G. Pechanek S. Vassiliadis J. G. Delgado

1993-01-01

32

alpha(sub critical) for Parallel Processors.  

National Technical Information Service (NTIS)

Alpha(sub critical) is defined as the fraction of a computational task that must be executed in parallel for the theoretical peak performance for two parallel processors to be equal. Values of alpha(sub critical) are given for a number of contemporary hig...

D. Moncrieff R. E. Overill S. Wilson

1994-01-01

33

Parallel Data Mining on Graphics Processors  

Microsoft Academic Search

We introduce GPUMiner, a novel parallel data mining system that utilizes new-generation graphics processing units (GPUs). Our sys- tem relies on the massively multi-threaded SIMD (Single Instruc- tion, Multiple-Data) architecture provided by GPUs. As special- purpose co-processors, these processors are highly optimized for graphics rendering and rely on the CPU for data input\\/output as well as complex program control. Therefore,

Wenbin Fang; Ka Keung Lau; Mian Lu; Xiangye Xiao; Chi Kit Lam; Philip Yang Yang; Bingsheng He; Qiong Luo; Pedro V. Sander; Ke Yang

2008-01-01

34

Parallel algorithms with processor failures and delays. Technical report  

SciTech Connect

The authors study efficient deterministic parallel algorithms on two models: restartable fail-stop CRCW PRAMs and strongly asynchronous PRAMs. In the first model, synchronous processors are subject to arbitrary stop failures and restarts determined by an on-line adversary and involving loss of private but not shared memory; the complexity measures are completed work (where processors are charged for completed fixed-size update cycles) and overhead ratio (completed work amortized over necessary work and failure). In the second model, the result of the computation is a serialization of the actions of the processors determined by an on-line adversary; the complexity measure is total work (number of steps taken by all processors). Despite their differences the two models share key algorithmic techniques. They present new algorithms for the Write-All problem (in which P processors write ones into an array of size N) for these two models. These algorithms can be used to implement a simulation strategy for any N processor PRAM on a restartable fail-stop P processor CRCW PRAM such that it guarantees a terminating execution of each simulated N processor step, with O(log sq N) overhead ratio.

Buss, J.F.; Kanellakis, P.C.; Radge, P.L.; Shvartsman, A.A.

1991-08-01

35

ST-100 array processor architectural highlights  

SciTech Connect

The ST-100 array processor is a high-performance, cost-effective array processor. It attaches to general-purpose computer systems for use in signal processing, image processsing, simulation, geophysical applications, and general scientific computing. The architectural features of the ST-100 result in a 100 megaflop array processor designed to fit into a multiple computer system environment. The ST-100 provides high performance at a low cost, ease of use, and system flexibility. The structure of the ST-100 will allow it to fit into unique system configurations as the application world continues to change.

Hausman, R.C.; Cannon, P.A. II

1983-01-01

36

An efficient SAR parallel processor  

Microsoft Academic Search

A parallel architecture especially designed for a synthetic-aperture-radar (SAR) processing algorithm based on an appropriate two-dimensional fast Fourier transform (FFT) code is presented. The algorithm is briefly summarized, and the FFT code is given for the one-dimensional case, although all results can be immediately generalized to the double FFT. The computer architecture, which consists of a toroidal net with transputers

Giorgio Franceschetti; ANTONINO MAZZEO; NICOLA MAZZOCCA; V. Pascazio; GILDA SCHIRINZI

1991-01-01

37

Fault-Tolerant Processor Arrays Based on the 1½Track Switches with Flexible Spare Distributions  

Microsoft Academic Search

A mesh-connected processor array consists of many similar processing elements (PEs) which can be executed in both parallel and pipeline processing. For the implementation of an array of large numbers of processors, some fault-tolerant issues are necessary to enhance the (fabrication-time) yield and the (run-time) reliability. In this paper, we propose a fault-tolerant reconfigurable processor array using single-track switches like

Tadayoshi Horita; Itsuo Takanami

2000-01-01

38

Static and Dynamic Processor Scheduling Disciplines in Heterogeneous Parallel Architectures  

Microsoft Academic Search

Most parallel jobs cannot be fully parallelized. In a homogeneous parallel machine-one in which all processors are identical-the serial fraction of the computation has to be executed at the speed of any of the identical processors, limiting the speedup that can be obtained due to parallelism. In a heterogeneous architecture, the sequential bottleneck can be greatly reduced by running the

D. A. Menasce; D. Saha; S. C. D. Porto; V. A. F. Almeida; S. K. Tripathi

1995-01-01

39

Supporting dynamic parallel object arrays  

Microsoft Academic Search

ABSTRACT We present efficient support for generalized arrays of parallel data driven objects. Array elements are regular C++ objects, and are scattered across the parallel machine. An individual element is addressed by its \\

Orion Sky Lawlor; Laxmikant V. Kalé

2003-01-01

40

Scalable Unix tools on parallel processors  

SciTech Connect

The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.

Gropp, W.; Lusk, E.

1994-12-31

41

Ray tracing on a networked processor array  

Microsoft Academic Search

As computation costs increase to meet design requirements for computation-intensive graphics applications on today's embedded systems, the pressure to develop high-performance parallel processors on a chip will increase. Acceleration of the ray tracing computation has become a major issue as the computer graphics industry demands for rendering realistic images. Network-on-chip (NoC) techniques that interconnect multiple processing elements with routers are

Jungsook Yang; Seung Eun Lee; Chunyi Chen; Nader Bagherzadeh

2010-01-01

42

Design of a fault-tolerant parallel processor  

Microsoft Academic Search

The Charles Stark Draper Laboratory, under contract to the NASA Johnson Space Center, has developed a Fault-Tolerant Parallel Processor (FTPP) for use on the NASA X-38 experimental vehicle. Using commercial processor boards and the industry-standard VME backplane, the system is configured as a quadruplet Flight-Critical Processor (FCP) and five simplex Instrumentation Control Processors (ICPs). The FCP is Byzantine resilient for

Roger Racine; Michael LeBlanc; Samuel Beilin

2002-01-01

43

A systolic array parallelizing compiler  

SciTech Connect

This book presents a completely new approach to the problem of systolic array parallelizing compiler. It describes the AL parallelizing compiler for the Warp systolic array, the first working systolic array parallelizing compiler which can generate efficient parallel code for complete LINPACK routines. This book begins by analyzing the architectural strength of the Warp systolic array. It proposes a model for mapping programs onto the machine and introduces the notion of data relations for optimizing the program mapping. Also presented are successful applications of the AL compiler in matrix computation and image processing. A complete listing of the source program and compiler-generated parallel code are given to clarify the overall picture of the compiler. The book concludes that systolic array parallelizing compiler can produce efficient parallel code, almost identical to what the user would have written by hand.

Tseng, P.S. (Bell Communications Research, Inc. (US))

1990-01-01

44

The K2 Parallel Processor: Architecture and Hardware Implementation  

Microsoft Academic Search

K2 is a distributed-memory parallel processor designed to support a multiuser, multitasking, time-sharing operating system and an automatically parallelizing Fortran compiler. The architecture and the hardware implementation of K2 are presented. The authors focus on the architectural features required by the operating system and the compiler. A prototype machine with 24 processors is currently being developed

Marco Annaratone; Marco Fillo; Kiyoshi Nakabayashi; Marc A. Viredaz

1990-01-01

45

Low-complexity distributed parallel processor for 2D IIR broadband beam plane-wave filters  

Microsoft Academic Search

Real-time systolic-array-based implementations of VLSI two-dimensional (2D) infinite-impulse-response (IIR) frequency-planar beam plane-wave filters have potentially wide applications in the filtering of spatio-temporal RF broadband plane waves based on their directions of arrival (DOAs). Distributed-parallel-processor (DPP) implementations of the systolic arrays allow synchronous sampling of the 2D input signal array, but because of the direct-form structure they have high circuit complexity.

H. L. P. A. Madanayake; Len Bruton

2007-01-01

46

A mixed-signal array processor with early vision applications  

Microsoft Academic Search

Many early vision tasks require only 6 to 8 b of precision. For these applications, a special-purpose analog circuit is often a smaller, faster, and lower power solution than a general-purpose digital processor, but the analog chips lack the programmability of digital image processors. This paper presents a programmable mixed-signal array processor which combines the programmability of a digital processor

David A. Martin; Hae-Seung Lee; Ichiro Masaki

1998-01-01

47

Massively parallel MRI detector arrays.  

PubMed

Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas via reception, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called "ultimate" SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

Keil, Boris; Wald, Lawrence L

2013-02-07

48

Massively parallel MRI detector arrays  

NASA Astrophysics Data System (ADS)

Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas via reception, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called "ultimate" SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays.

Keil, Boris; Wald, Lawrence L.

2013-04-01

49

A transformative approach to the partitioning of processor arrays  

Microsoft Academic Search

The paper describes the systematic design of processor arrays with a given dimension and a given number of processing elements. The unified approach to the solution of this problem called partitioning is based on the following concepts: (1) Algorithms and processor arrays are represented by (piecewise regular) programs. (2) The concept of stepwise refinement of programs is used to solve

Jiirgen Teich; Lothar Thiele

1992-01-01

50

Scalable parallel suffix array construction  

Microsoft Academic Search

Abstract. Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many,other applications in particular in bioinformatics. We describe the first implementation and experimental evaluation of a scalable parallel algorithm for suffix array construction. The implementation works on distributed memory computers using MPI, Experiments with up to

Fabian Kulla; Peter Sanders

2007-01-01

51

Wavefront simulator for evaluating RF communication array signal processors  

NASA Astrophysics Data System (ADS)

A wavefront simulator that emulates plane wave propagation from multiple transmitting antennas is used to evaluate, in both a static and a dynamic manner, an array processor used in RF communications to determine the location of transmitting antennas and possibly to perform beamforming for cancelling the energy of an interfering transmitter. The wavefront simulator generates time delay signals, giving the appearance of being emitted from different transmitters or sources, and simulates those signals as being received by an antenna array associated with the array processor. The array processor utilizes the time delay signals to calculate, e.g., the angle of arrival of the signals from the emitting antennas.

Minarik, Steven B.

1993-10-01

52

Pringle: A Parallel Processor to Emulate CHiP (Configurable Highly Parallel) Computers.  

National Technical Information Service (NTIS)

The Pringle is a 64 processor parallel computer designed to serve as a laboratory instrument for studying Configurable, Highly Parallel (CHiP) Computers. The Pringle's design objectives, architecture and physical characteristics are presented. A key compo...

A. A. Kapauan J. T. Field L. Snyder

1983-01-01

53

Chromosome image segmentation on PAL parallel image processor  

Microsoft Academic Search

Chromosome image segmentation is an important step toward automatic karyotyping that involves visualization and interpretation of chromosomes. In this paper, we analyze the characteristics of chromosome images that can be effectively used for segmenting chromosomes and can be efficiently extracted on the Lockheed-Martin PAL parallel image processor. We design and implement a parallel algorithm that uses local features to split

Hongchi Shi; Paul D. Gader; Hongzheng Li

1997-01-01

54

Chromosome image segmentation on PAL parallel image processor  

NASA Astrophysics Data System (ADS)

Chromosome image segmentation is an important step toward automatic karyotyping that involves visualization and interpretation of chromosomes. In this paper, we analyze the characteristics of chromosome images that can be effectively used for segmenting chromosomes and can be efficiently extracted on the Lockheed-Martin PAL parallel image processor. We design and implement a parallel algorithm that uses local features to split touching chromosomes.

Shi, Hongchi; Gader, Paul D.; Li, Hongzheng

1997-09-01

55

CMOS processor element for a fault-tolerant SVD array  

NASA Astrophysics Data System (ADS)

This paper describes the VLSI implementation of a CORDIC based processor element for use in a fault-reconfigurable systolic array to compute the singular value decomposition (SVD) of a matrix. The chip implements a time redundant fault tolerance scheme, which allows processors adjacent to a faulty processor to act as computation backup during the systolic idle time. Also, processors around a fault collaborate to reroute data around the faulty processor. This form of time redundancy is attractive when tolerance to a few faults needs to be achieved with little hardware overhead.

Kota, Kishore; Cavallaro, Joseph R.

1993-11-01

56

A 5.9mW 6.5GMACS CID\\/DRAM array processor  

Microsoft Academic Search

The pattern recognition processor performs digital vector matrix multiplication using internally analog fine-grain parallel computing. The three-transistor CID\\/DRAM unit cell combines single-bit dynamic storage, binary multiplication, and zero-latency analog accumulation. Delta-sigma analog-to-digital conversion of the analog array outputs is combined with oversampled unary coding of the digital inputs. The 256 × 128 CID\\/DRAM processor with integrated 128 delta-sigma ADCs measures

Roman Genov; Gert Cauwenberghs; Grant Mulliken; Farhan Adil

2002-01-01

57

New multilevel parallelism management for multimedia processors  

NASA Astrophysics Data System (ADS)

This paper presents a new parallelism manager for multimedia multiprocessors. An analysis of recent multimedia applications shows that the available parallelism moves from the data-level to the control-level. New architectures are required to be able to extract this kind of dynamic parallelism. Our proposed parallelism management describes the parallelism with a topological description of the task dependence graph. It allows to represent various and complex parallelism patterns. This parallelism description is separated from the program code to allow the task manager to decode it in parallel with the task execution. The task manager is based on a queue bank that stores the task graph. Control commands are inserted in the task dependence graph to allow a dynamic modification of this graph, depending on the processed data. Simulations on classical multiprocessing benchmarks show that in case of simple parallelism, we have similar performances than classical systems. However, the performances on complex applications are improved up to 12%. Multimedia applications have also bee simulated. The results show that our task manager can efficiently handle complex dynamic parallelism structures.

Verians, Xavier; Legat, Jean-Didier; Macq, Benoit M.; Quisquater, Jean-Jacques

1998-12-01

58

Portable QCD codes for massively parallel processors  

NASA Astrophysics Data System (ADS)

We present a new set of QCD codes in both message passing and data parallel versions. The message passing package used is PARMACS, although other packages may be used. Data parallel software is written in High Performance Fortran, an emerging standard based on Fortran 90. Software engineering methods have been applied to a physics application to create thoroughly tested and documented codes for the next generation of massively parallel supercomputers. Department of Physics, University of Edinburgh, Edinburgh EH9 3JZ, Scotland, UK

1994-04-01

59

DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors  

Microsoft Academic Search

We present a low-complexity heuristic, named the dominant sequence clusteringalgorithm (DSC), for scheduling parallel tasks on an unbounded number of completelyconnected processors. The performance of DSC is on average, comparable to, or evenbetter than, other higher-complexity algorithms. We assume no task duplication andnonzero communication overhead between processors. Finding the optimum solution forarbitrary directed acyclic task graphs (DAG's) is NP-complete. DSC

Tao Yang; Apostolos Gerasoulis

1994-01-01

60

Parallelizing the Volcano database query processor  

Microsoft Academic Search

Volcano is a new data flow query processing system developed for database systems research and education. All operators are designed and coded as if they were meant for a single-process system only. The design implementation of Volcano's exchange operator that parallelizes all other operators is described. It allows intraoperator parallelism on partitioned data assets and both vertical and horizontal interoperator

Goetz Graefe

1990-01-01

61

Memory based processor array for artificial neural networks  

Microsoft Academic Search

In this paper an effective memory-processor integrated architecture, called memory based processor array for artificial neural networks (MPAA), is proposed. The MPAA can be easily integrated into any host system via memory interface. Specifically, the MPAA system provides an efficient mechanism for its local memory accesses allowed by the row basis and the column basis using the hybrid row and

Youngsik Kim; Mi-Jung Noh; Tack-Don Han; Shin-Dug Kim; Sung-Bong Yang

1997-01-01

62

Global synchronization of parallel processors using clock pulse width modulation  

DOEpatents

A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

2013-04-02

63

Experience with a multiprocessor based on eight FPS 120B array processors  

SciTech Connect

The rate of increase in the speed of monoprocessors is no longer keeping pace with the needs of the laboratory; accordingly, the use of parallel processors in large scientific computations is being investigated. As an initial experiment, a particle-in-cell plasma simulation was adapted to run on a star graph architecture consisting of a UNIVAC 1110 as hub, and up to eight Floating Point Systems AP120B array processors at the other vertices. Subdivision of tasks among processors and measured results are discussed.

Bucher, I.Y.; Frederickson, P.O.; Moore, J.W.

1981-01-01

64

Array processors with pipelined optical busses  

Microsoft Academic Search

A synchronous multiprocessor architecture based on pipelined optical bus interconnections is presented. The processors are placed in a square grid and are interconnected to one another through horizontal and vertical optical buses. This architecture has an effective diameter as small as two owing to its orthogonal bus connections, and it allows all processors to have simultaneous access to the buses

Zicheng Guo; Rami G. Melhem; Richard W. Hall; Donald M. Chiarulli; Steven P. Levitan

1990-01-01

65

The RAP: a ring array processor for layered network calculations  

Microsoft Academic Search

The authors have designed and implemented a ring array processor, RAP, for fast implementation of layered neural network algorithms. The RAP is a multi-DSP system targeted at continuous speech recognition using connectionist algorithms. Four boards, each with four Texas Instruments, TMS 320C30 DSPs, serve as an array processor for a 68020-based host running a real-time operating system. The overall system

N. Morgan; J. Beck; P. Kohn; J. Bilmes; E. Allman; J. Beer

1990-01-01

66

An Evaluation of Document Retrieval from Serial Files Using the ICL Distributed Array Processor.  

ERIC Educational Resources Information Center

|Describes preliminary investigation of the use of International Computers Limited's Distributed Array Processor (DAP) for parallel searching of large serial files of documents. DAP hardware and software, test collections, measurement of DAP performance, search algorithms, experimental results, and DAP suitability for interactive searching are…

Pogue, Christine; Willett, Peter

1984-01-01

67

3081/E emulator, a processor for use in on-line and off-line arrays  

SciTech Connect

This paper presents a status report on the 3081/E covering the processor hardware, interfacing capability, and accompanying software. Details of production figures and preliminary performance results are given. Plans for the use of arrays of 3081/Es for parallel event processing in both on-line and off-line systems are outlined.

Ferran, P.M.; Fucci, A.; Gallno, P.; Hinton, R.; Jacobs, D.; Kudla, M.; Martin, B.; Masuch, H.; Storr, K.M.; Gravina, M.

1985-08-01

68

High Speed Systolic Array Processor (HiSSAP) System Development Synopsis: Lesson Learned.  

National Technical Information Service (NTIS)

This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic an...

J. P. Loughlin

1991-01-01

69

Massively parallel processor networks with optical express channels  

Microsoft Academic Search

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a

Robert J. Deri; Brooks; Eugene D; Ronald E. Haigh; Anthony J. DeGroot

1999-01-01

70

Fast multi-image interaction and hierarchical processing-a new video array processor (VAP-80)  

SciTech Connect

An array processor structure is described which allows realization of a large set of local transforms including nonlinear ones at the TV scan rate of the quantimet system. This is achieved in a combination of a parallel and pipeline structure which uses look-up tables to perform all calculations. Some small hardware add-ons extend to applications to multi-image processing by a special organisation of the associated image memory permitting storage of several independent subimages. They may be processed jointly in the array processor. Some results on hierarchical labelling procedures of biological electron micrographs are presented as examples and compared to straightforward segmentation. 6 references.

Keller, Hj.; Comazzi, A.; Favre, A.

1982-01-01

71

Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem  

NASA Astrophysics Data System (ADS)

The algebraic path problem (APP) is a general framework which unifies several solution procedures for a number of well-known matrix and graph problems. In this paper, we present a new 3-dimensional (3-D) orbital algebraic path algorithm and corresponding 2-D toroidal array processors which solve the n × n APP in the theoretically minimal number of 3n time-steps. The coordinated time-space scheduling of the computing and data movement in this 3-D algorithm is based on the modular function which preserves the main technological advantages of systolic processing: simplicity, regularity, locality of communications, pipelining, etc. Our design of the 2-D systolic array processors is based on a classical 3-D?2-D space transformation. We have also shown how a data manipulation (copying and alignment) can be effectively implemented in these array processors in a massively-parallel fashion by using a matrix-matrix multiply-add operation.

Sedukhin, Stanislav G.; Miyazaki, Toshiaki; Kuroda, Kenichi

72

A high performance distributed-parallel-processor architecture for 3D IIR digital filters  

Microsoft Academic Search

Abstract—Real-time spatio-temporal VLSI 3D IIR digital filters may be used for imaging or beamforming applications employing 3D input signals from synchronously-sampled multi-sensor arrays. Such filters have high computational complexity and often require arithmetic throughputs of hundreds of millions of floating point operations per second, especially in the case of potential radio frequency beamforming applications. A novel high-throughput distributed parallel processor

Arjuna Madanayake; Leonard T. Bruton

2005-01-01

73

Automatic generation of synchronization instructions for parallel processors  

SciTech Connect

The development of high speed parallel multi-processors, capable of parallel execution of doacross and forall loops, has stimulated the development of compilers to transform serial FORTRAN programs to parallel forms. One of the duties of such a compiler must be to place synchronization instructions in the parallel version of the program to insure the legal execution order of doacross and forall loops. This thesis gives strategies usable by a compiler to generate these synchronization instructions. It presents algorithms for reducing the parallelism in FORTRAN programs to match a target architecture, recovering some of the parallelism so discarded, and reducing the number of synchronization instructions that must be added to a FORTRAN program, as well as basic strategies for placing synchronization instructions. These algorithms are developed for two synchronization instruction sets. 20 refs., 56 figs.

Midkiff, S.P.

1986-05-01

74

Mapping Radiosity Computations to Parallel Processors.  

NASA Astrophysics Data System (ADS)

The radiosity method for rendering scenes is gaining popularity because of its ability to accurately model the energy distribution in an environment. As this photonic energy distribution is independent of the viewer's position, generating scenes for different viewpoints only requires hidden surface removal and can be performed in real-time. This makes it more attractive than ray tracing as a technique for modeling illumination. It is quite conceivable that radiosity method will be used for applications in scientific visualization, lighting simulations, CAD/CAM, virtual reality, and medical imaging. Computing radiosity of a scene with moderate to high complexity is tantamount to solving a system of tens of thousands of linear equations. Iterative linear system solvers, such as Gauss-Seidel, Jacobi, or conjugate descent, are quite demanding for a system of equations this large. An alternate approach, known as progressive refinement, offers some computational tractability and delivers an approximate solution relatively quickly. This dissertation presents the results of partitioning the radiosity computation to suitably map on a variety of multiprocessor classes. The effect of problem decomposition on computation and communication components is studied for the shared memory, the message passing and the loosely coupled distributed memory multiprocessors. Kendall Square Research's KSR1 and Intel hypercube iPSC/860 were used for experimenting with the shared memory and message-passing algorithms respectively. A network of IBM RS/6000 was used for understanding coarse grain parallelization techniques. These experiments demonstrated that optimality of parallel algorithms must be considered as a < machine, algorithm > pair. Thus the notion of program portability must also take machine architecture in consideration beside allowing for software compatibility. As the number of polygons for processing complex scenes continues to grow, the subdivision in the object space become increasingly important. An adaptive technique for binary subdivision of the object space is outlined and used in all the experiments. The resulting tree has a better balance as compared to the conventional techniques. A multiprocessor architecture that utilizes the object space subdivision and uses the token driven dataflow computation model is proposed as a hardware solution for radiosity. The proposed architecture is targeted toward the high end workstations which can benefit from the proposed design in performing radiosity computation and other similar tasks.

Singh, Gautam Bir

75

Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids  

SciTech Connect

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

2011-11-08

76

Performance Studies of a Parallel Processor for Large Linear Programming Problems.  

National Technical Information Service (NTIS)

In order to speed up the solution of large linear programming (LP) problems a new parallel processor will be developed. For this processor a parallel algorithm is established and investigated. Several pivot searching strategies are described and compared ...

J. Luo

1986-01-01

77

Real-time simulation of MHD/steam power plants by digital parallel processors  

NASA Astrophysics Data System (ADS)

Attention is given to a large FORTRAN coded program which simulates the dynamic response of the MHD/steam plant on either a SEL 32/55 or VAX 11/780 computer. The code realizes a detailed first-principle model of the plant. Quite recently, in addition to the VAX 11/780, an AD-10 has been installed for usage as a real-time simulation facility. The parallel processor AD-10 is capable of simulating the MHD/steam plant at several times real-time rates. This is desirable in order to develop rapidly a large data base of varied plant operating conditions. The combined-cycle MHD/steam plant model is discussed, taking into account a number of disadvantages. The disadvantages can be overcome with the aid of an array processor used as an adjunct to the unit processor. The conversion of some computations for real-time simulation is considered.

Johnson, R. M.; Rudberg, D. A.

78

Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor  

Microsoft Academic Search

For three years, members of the Computer Science Depart- ment at the University of Rochester have used a collection of BBN Butterfly TM Parallel Processors to conduct research in parallel systems and applications. For most of that time, Rochester's 128-node machine has had the distinc- tion of being the largest shared-memory multiprocessor in the world. In the course of our

Thomas J. LeBlanc; Michael L. Scott; Christopher M. Brown

1988-01-01

79

Torus-ring-bus connected highly parallel processor  

Microsoft Academic Search

A cluster structure interconnection network for massively parallel computers is proposed. Each processor is connected to a one-dimensional ring-bus network in a cluster, and each cluster is connected to a two-dimensional torus network. Basic network performance is estimated by software simulations for two communication patterns, namely, nearest neighbor communication and general communication. The architecture of a prototype machine that is

M. Kohata; T. Itoh; Y. Miyagaki

1993-01-01

80

Error analysis of high data rate, optical parallel processors.  

PubMed

Optical parallel processors have the potential for aiding the transfer of information over networks. The systems implications for a baseline architecture employing spatial light modulators, lenses, and charge-coupled devices are examined. Specifically, because many applications have stringent requirements on errors, this study concentrates on categorizing the potential error sources-both random and systematic-and presents the results of an error analysis for a pixel-to-pixel mapping system as a notional example. PMID:18357233

Jackson, D J; Juncosa, M L

2001-05-10

81

Efficient parallel algorithms on restartable fail-stop processors  

SciTech Connect

The authors study efficient deterministic executions of parallel algorithms on restartable fail-stop CRCW PRAMs. They allow the PRAM processors to be subject to arbitrary stop failures and restarts, that are determined by on-line adversary, and that result in loss of private memory but do not affect shared memory. For this model, they define and justify the complexity measures of: completed work, where processors are charged for completed fixed-size update cycles, and overhead ratio, which amortizes the work over necessary work and failures. We observe that P = N restartable fail-stop processors, the Write-All problem requires omega(N log N) completed work, and this lower bound holds even under the additional assumption that processors can read and locally process the entire shared memory at unit cost. Under this unrealistic assumption they have a matching upperbound. The lower bound also applies to the expected completed work of randomized algorithms that are subject to on line adversaries. Finally, they describe a simple on-line adversary that causes inefficiency in many randomized algorithms.

Kanellakis, P.C.; Shvartsman, A.A.

1991-01-01

82

Programmable inner-product enhanced associative processor array  

NASA Astrophysics Data System (ADS)

Associative Processors have become popular because of their ability to perform parallel operations in massive scale. The use of Associative Processors especially for MPEG4/H.263 video coding was found to have low power consumption. However they lack the ability to perform computationally intensive block transforms. The paper discusses requirements for video processing and shows how Associative Processors are more suited for video coding than RISC architectures. We highlight the various drawbacks of using Associative Processors for video coding and propose a new Distributed Arithmetic based enhancement to the architecture that provides greater flexibility in the implementation of video coding algorithms. These modifications help in faster computation of DCT and simulations of the proposed enhancement show that MPEG 4 simple profile encoder can be implemented in less than 10 MIPS.

Balam, Subhash C.; Hariharakrishnan, Karthik; Schonfeld, Dan

2004-04-01

83

Performance Improvements in Array Processors from Fast (On-Chip) Memory.  

National Technical Information Service (NTIS)

The report presents ideas, based on experimental results, concerning possible improvements to the architecture of the Distributed Array Processor. The first section describes what the current limitations of the array processors are and what results can be...

C. Lambrinoudakis

1988-01-01

84

Parallel processor system specific for Monte Carlo analysis based on ring bus architecture  

Microsoft Academic Search

We have developed a new parallel processor system specific for the MC analysis, to dramatically reduce the calculation time. Our parallel processor system is based on ring bus architecture. The RISC micro processor chip, which contains a ring bus interface unit (RBIU), a floating point arithmetic unit (FAU) and so on, was also developed for our system. Speed up ratio

H. Kurino; T. Ono; N. Kuroishi; T. Kawata; N. Miyakawa; M. Fukase; R. Aibara; M. Koyanagi

1998-01-01

85

Language Parallel Pascal and other aspects of the massively parallel processor  

SciTech Connect

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

Reeves, A.P.; Bruner, J.D.

1982-01-01

86

Sn transport calculations on vector and parallel processors  

SciTech Connect

The transport of radiation from the source to the location of people or equipment gives rise to some of the most challenging of calculations. A problem may involve as many as a billion unknowns, each evaluated several times to resolve interdependence. Such calculations run many hours on a Cray computer, and a typical study involves many such calculations. This paper will discuss the steps taken to vectorize the DOT code, which solves transport problems in two space dimensions (2-D); the extension of this code to 3-D; and the plans for extension to parallel processors.

Rhoades, W.A.; Childs, R.L.

1987-01-01

87

Frequency-multiplexed and pipelined iterative optical systolic array processors.  

PubMed

Optical matrix processors using acoustooptic transducers are described with emphasis on new systolic array architectures using frequency multiplexing in addition to space and time multiplexing. A Kalman filtering application is considered as our case study from which the operations required on such a system can be defined. This also serves as a new and powerful application for iterative optical processors. The importance of pipelining the data flow and the ordering of the operations performed in a specific application of such a system are also noted. Several examples of how to effectively achieve this are included. A new technique for handling bipolar data on such architectures is also described. PMID:18195755

Casasent, D; Jackson, J; Neuman, C P

1983-01-01

88

Mapping iterative algorithms onto processor arrays by the use of Petri Net models  

Microsoft Academic Search

In this paper, Petri Nets (PNs) are used for deriving efficient mapping transformations of a wide class of algorithms to processor arrays. In the proposed methodology, given an algorithm and the interconnections of the processor array, two PNs are constructed: one that is related to the algorithm and one that is related to the processor array. The former PN models

K. E. Karagianni; E. D. Kyriakis-Bitzaros; T. Stouraitis

1994-01-01

89

Latin Squares for Parallel Array Access  

Microsoft Academic Search

A parallel memory system for efficient parallel array access using perfect latin squares asskewing functions is discussed. Simple construction methods for building perfect latinsquares are presented. The resulting skewing scheme provides conflict free access toseveral important subsets of an array. The address generation can be performed inconstant time with simple circuitry. The skewing scheme can provide constant timeaccess to rows,

Kichul Kim; Viktor K. Prasanna

1993-01-01

90

Perfect Latin squares and parallel array access  

Microsoft Academic Search

A new nonlinear skewing scheme is proposed for parallel array access. We introduce a new Latin square(perfect Latin square) which has several properties useful for parallel array access. A sufficient condition for the existence of perfect Latin squares and a simple construction method for perfect Latin squares are presented. The resulting skewing scheme provides conflict free access to various subsets

Kichul Kim; V. K. Prasanna-Kumar

1989-01-01

91

Microphone Array PostProcessor Using Instantaneous Direction of Arrival  

Microsoft Academic Search

In this paper we describe a novel algorithm for postprocessing a microphone array's beamformer output to achieve better spatial filtering under noise and reverberation. For each audio frame and frequency bin the algorithm estimates the spatial probability for sound source presence and applies a spatio-temporal filter towards the look-up direction. It is implemented as a real-time post-processor after a timeinvariant

Ivan Tashev; Alex Acero

2006-01-01

92

OPALS - Optical parallel array logic system  

Microsoft Academic Search

A new optical-digital computing system called OPALS (optical parallel array logic system) is presented. OPALS can execute various parallel neighborhood operations such as cellular logic as well as parallel logical operations for two-dimensional sampled objects. The system has the ability to perform iterative operations. OPALS is systemized, centering on the optical logic method using image coding and optical correlation techniques.

Jun Tanida; Yoshiki Ichioka

1986-01-01

93

An informal introduction to program transformation and parallel processors  

SciTech Connect

In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

1994-08-01

94

Acousto-optic signal processors for transmission and reception of phased-array antenna signals.  

PubMed

Novel acousto-optic processors for control and signal processing in phased-array antennas are presented. These processors can operate in both the antenna transmit and receive modes. An experimental acousto-optic processor is demonstrated in the laboratory. This optical technique replaces all the phase-shifting devices required in electronically controlled phased-array antennas. PMID:20706392

Riza, N A; Psaltis, D

1991-08-10

95

Design of a parallel RISC image processor based on PCI bus  

NASA Astrophysics Data System (ADS)

Low-level image processing operations usually involve simple and repetitive operations over the entire input images, thus image processor may communicate with the memory system or each other frequently, hence the image processor would provide high throughput rate. In this article we present an architectural design and analysis of a parallel RISC image processor. The processor was based on PCI bus to speed up a range of image processing operations. The other characteristic of the processor is that a new three-port hostbridge is integrated into the processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed.

Jiang, Xianyang; Shen, Xubang; Zhang, Tianxu

2001-09-01

96

Scalable Parallel Suffix Array Construction  

Microsoft Academic Search

Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications in particular in bioinformatics. We describe the first implementation and experimental evaluation of a scalable paral- lel algorithm for suffix array construction. The implementation works on distributed memory computers using MPI, Experiments with up

Fabian Kulla; Peter Sanders

2006-01-01

97

The performance realities of massively parallel processors: A case study  

SciTech Connect

This paper presents the results of an architectural comparison of SIMD massive parallelism, as implemented in the Thinking Machines Corp. CM-2 computer, and vector or concurrent-vector processing, as implemented in the Cray Research Inc. Y-MP/8. The comparison is based primarily upon three application codes that represent Los Alamos production computing. Tests were run by porting optimized CM Fortran codes to the Y-MP, so that the same level of optimization was obtained on both machines. The results for fully-configured systems, using measured data rather than scaled data from smaller configurations, show that the Y-MP/8 is faster than the 64k CM-2 for all three codes. A simple model that accounts for the relative characteristic computational speeds of the two machines, and reduction in overall CM-2 performance due to communication or SIMD conditional execution, is included. The model predicts the performance of two codes well, but fails for the third code, because the proportion of communications in this code is very high. Other factors, such as memory bandwidth and compiler effects, are also discussed. Finally, the paper attempts to show the equivalence of the CM-2 and Y-MP programming models, and also comments on selected future massively parallel processor designs.

Lubeck, O.M.; Simmons, M.L.; Wasserman, H.J.

1992-07-01

98

The performance realities of massively parallel processors: A case study  

SciTech Connect

This paper presents the results of an architectural comparison of SIMD massive parallelism, as implemented in the Thinking Machines Corp. CM-2 computer, and vector or concurrent-vector processing, as implemented in the Cray Research Inc. Y-MP/8. The comparison is based primarily upon three application codes that represent Los Alamos production computing. Tests were run by porting optimized CM Fortran codes to the Y-MP, so that the same level of optimization was obtained on both machines. The results for fully-configured systems, using measured data rather than scaled data from smaller configurations, show that the Y-MP/8 is faster than the 64k CM-2 for all three codes. A simple model that accounts for the relative characteristic computational speeds of the two machines, and reduction in overall CM-2 performance due to communication or SIMD conditional execution, is included. The model predicts the performance of two codes well, but fails for the third code, because the proportion of communications in this code is very high. Other factors, such as memory bandwidth and compiler effects, are also discussed. Finally, the paper attempts to show the equivalence of the CM-2 and Y-MP programming models, and also comments on selected future massively parallel processor designs.

Lubeck, O.M.; Simmons, M.L.; Wasserman, H.J.

1992-01-01

99

On Fault-Tolerant Structure, Distributed Fault-Diagnosis, Reconfiguration, and Recovery of the Array Processors  

Microsoft Academic Search

A study is made of the design of fault-tolerant array processors. It is shown how hardware redundancy can be used in the existing structures in order to make them capable of withstanding the failure of some of the array links and processors. Distributed fault-tolerance schemes are introduced for the diagnosis of the faulty elements, reconfiguration, and recovery of the array.

Seyed H. Hosseini

1989-01-01

100

Efficient implementation of a high-level language on a bit-serial parallel matrix processor  

SciTech Connect

Many modern supercomputers perform operations in parallel (more than one at once) to increase processing power. On important type of parallel processor is the bit-serial parallel matrix processor, an example of which is the NASA's Massively Parallel Processor. Programming a parallel matrix processor requires languages which are capable of directly expressing those operations which the machine can support. A good language should be powerful, easy to use, and efficiently implementable. This thesis describes the development of the language Parallel Pascal for the class of parallel matrix processors, and describes the implementation of Parallel Pascal by a two-phase compiler whose phases communicate through an intermediate language called Parallel P-code. The efficient implementation of Parallel Pascal by this means is examined. Parallel Pascal is shown to be well-suited to the MPP, and the use of Parallel P-code as an intermediate language is shown to be a workable implementation. The implementation of a simple input-output system is described. A heuristic method for generating specialized bit-serial functions is presented.

Bruner, J.D.

1982-01-01

101

Addressable microlens array for parallel laser microfabrication.  

PubMed

Parallel processing in femtosecond-laser-based microfabrication is demonstrated using a microlens array in conjunction with a liquid-crystal spatial light modulator (SLM). A portion of the SLM is mapped onto each individual lenslet in the array and can be used to effectively switch foci on and off for fabrication. In addition, the technique allows for homogenizing the intensity of the array of foci and translating spots relative to their natural focus. The technique demonstrates the potential for high efficiency processing of aperiodic structures. PMID:21686000

Salter, Patrick S; Booth, Martin J

2011-06-15

102

New parallel processor system with optical interconnection specific for Monte Carlo analysis  

Microsoft Academic Search

We have designed a parallel processor system with hundreds of processors specific for Monte- Carlo analysis. This system has the ring-bus architecture. The performance of several Gflops is expected in this system according to the computer simulation. However, it was revealed that the data transfer speed of the bus has to be increased more dramatically in order to further increase

Mitsumasa Koyanagi; Tamio Shimatani; Takuji Matsumoto; Kee-Ho Yu; Y. Yoshida; Reiji Aibara

1995-01-01

103

Parallelizing Methods Analysis for Solving Large Sparse Power Network Equations on Multi-core Processor Platforms  

Microsoft Academic Search

The solution of large sparse network equations is a recurrent problem in almost every algorithm of power system simulation, which is widely used in offline analysis, online stability assessment and control. Nowadays, processor architectures are moving toward the integration of more cores on one chip, whilst traditional serial and parallel software can not fully exploit the performance of multi-core processors.

Zhang Jia-an; Zhang Na; Jiang Yi-lang

2011-01-01

104

Time and Parallel Processor Bounds for Fortran-Like Loops  

Microsoft Academic Search

The main goal of this paper is to show that a large number of processors can be used effectively to speed up simple Fortran-like loops consisting of assignment statements. A practical method is given by which one can check whether or not a statement is dependent upon another. The dependence structure of the whole loop may be of different types.

Utpal Banerjee; Shyh-ching Chen; David J. Kuck; Ross A. Towle

1979-01-01

105

Optical absorption for parallel cylinder arrays  

Microsoft Academic Search

We study the long-wavelength electromagnetic resonances of interacting cylinder arrays. By using a normal-modes expansion where the effects of geometry and material are separated, it is shown that two parallel cylinders with different radii have electromagnetic modes distributed symmetrically about depolarization factor 1\\/2. Both sets couple to longitudinal and transverse components of the external field, but amplitudes of symmetric depolarization

P. Robles; R. Rojas; F. Claro

2002-01-01

106

Parallel algorithms and architectures for CPUs and dedicated processors: development and trends  

Microsoft Academic Search

Parallel algorithms are usually intended as those related to problems to be run on supercomputers characterized by a large number of processors interacting via a communication network. Parallel algorithms and architectures are relevant also at a lower levels. Of particular interest is the CPU level where elementary arithmetic (and higher order) operations are executed. It might be surprising to notice

Luigi Dadda

1995-01-01

107

Efficient Parallel Execution of Streaming Applications on Multi-core Processors  

Microsoft Academic Search

We propose a method for the parallel execution of applications that process continuous streams of data. Unlike pipeline-based approaches, which are frequently employed to parallelize software for multi-core processors, our method supports nonlinear structures that may contain conditionals. Nonlinear structures reduce the latency for processing an element from a stream, which is particularly important for embedded systems that are subject

Tobias Schuele; Siemens AG

2011-01-01

108

Embedding Binary X-Trees and Pyramids in Processor Arrays with Spanning Buses  

Microsoft Academic Search

We stiudy the problem of network embeddings in 2-D array architectures in which each row and column of processors are intercon- nected by a bus. These architectures are especially attractive if optical buses are used that allow simultaneous access by multiple processors through either wavelength division multiplexing or message pipelining, thus overcoming the bottlenecks caused by the exclusive access of

Zicheng Guo; Rami G. Melhem

1994-01-01

109

Optically switching parallel processors by means of Langmuir-Blodgett multilayer films.  

PubMed

The photoexcitation transport and the reversible photochemical reaction in Langmuir-Blodgett film is studied and shows its applicability to an optical parallel processor. The excitation transport is switched depending on UV or visible irradiation, which controls the photochromic reaction. The performance of this photochromic Langmuir-Blodgett multilayer is 10(12) NOT operations s(-1) and is comparable with other types of optical processors, such as a liquid-crystal light valve. PMID:20962960

Yamazaki, I; Okazaki, S; Minami, T; Ohta, N

1994-11-10

110

Digital convolution filtering techniques on an array processor for particle image velocimetry  

Microsoft Academic Search

The use of digital, convolution filtering techniques in particle image velocimetry is described. The technique is illustrated by considering its application to real and synthetic particle image velocimetry images using a dedicated array processor.

Ian Grant; Jian Hang Qiu

1990-01-01

111

On fault-tolerant structure, distributed fault-diagnosis, reconfiguration, and recovery of the array processors  

SciTech Connect

The increasing need for the design of high-performance computers has led to the design of special purpose computers such as array processors. This paper studies the design of fault-tolerant array processors. First, it is shown how hardware redundancy can be employed in the existing structures in order to make them capable of withstanding the failure of some of the array links and processors. Then distributed fault-tolerance schemes are introduced for the diagnosis of the faulty elements, reconfiguration, and recovery of the array. Fault tolerance is maintained by the cooperation of processors in a decentralized form of control without the participation of any type of hardcore or fault-free central controller such as a host computer.

Hosseini, S.H.

1989-07-01

112

Implementation of Parallel Algorithms.  

National Technical Information Service (NTIS)

Contents: Intermediate Representation for Parallel Implementation; Data Movement on Processor Arrays; Data-Parallel Implementations of Fast Multipole Algorithms for N-Body Interaction; Rate Control in Parallel Algorithms; Implementing Asynchronous Paralle...

J. H. Reif R. Wagner

1993-01-01

113

Architecture and modeling of a parallel digital processor based image processing system  

NASA Astrophysics Data System (ADS)

The paper describes an image processing system which uses both shared memory and message passing. Shared memory is used in conjunction with a high speed parallel bus to transfer image data; message passing is used for general inter-processor communication. A prototype system based upon the Texas Instruments TMS320C40 digital signal processor is currently in the final stages of construction. A Petri Net model of the communication aspects of the TMS320C40 processor has been developed. Features of the Petri Net software are discussed and the raw communication performance of the TMS320C40 shown. The modeling of a four and sixteen processor system applied to 2D FFT transforms is described.

Hartley, David A.; Kshirsagar, Shirish P.

1994-09-01

114

Development of parallel implicit Navier-Stokes solvers on MIMD multi-processor systems  

SciTech Connect

The development of implicit numerical methods for the solution of the compressible Navier-Stokes equations on massively parallel systems is presented. The equations are solved for a generalized curvilinear coordinate system. Different numerical methods for the discretization of the fluxes as Flux Vector Splitting and Riemann solver up to third order of accuracy are parallelized. An implicit unfactored method is used for the solution of the system of equations using Gauss-Seidel subiteration states. The efficiency of the parallel Navier-Stokes solvers is investigated for compressible flow fields on different parallel machines while comparisons with single processor calculations are presented. 33 refs.

Drikakis, D.; Schreck, E. (Erlangen-Nuernberg Univ., Erlangen (Germany))

1993-01-01

115

A GaAs vector processor based on parallel RISC microprocessors  

NASA Astrophysics Data System (ADS)

A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.

Misko, Tim A.; Rasset, Terry L.

116

Parallelization of High Dynamic Range Image Creation on Multi-core Processor Architectures  

Microsoft Academic Search

The emergence of multi-core processor architectures and of diverse parallel computing paradigms has permeated into the area of mainstream computing. In this paper we present various parallelization approaches to High Dynamic Range image creation, a rising technology employed in the field of imaging manipulation and processing. OpenMP and Pthreads implementation details are provided, and the performance and load-balancing capabilities of

Vlad-Marian Spoiala; Emil Slusanschi; Monica Dagadita; Cristian Bancu

2011-01-01

117

A 40GOPS 250mW massively parallel processor based on matrix architecture  

Microsoft Academic Search

The matrix processing engine (MTX) is a massively parallel processor based on the matrix architecture. 40GOPS (16b additions) is achieved at 200MHz clock frequency and 250mW power dissipation. 2048 ALUs and 1Mb SRAM connected by a flexible switching network are integrated in 3.1mm2 using a 90nm CMOS process

M. Nakajima; H. Noda; K. Dosaka; K. Nakata; M. Higashida; O. Yamamoto; K. Mizumoto; H. Kondo; Y. Shimazu; K. Arimoto; K. Saitoh; T. Shimizu

2006-01-01

118

The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors  

Microsoft Academic Search

Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this paper we show how applications codes written in a subset of

Matthew T. O'keefe; Terence Parr; Kevin Edgar; Steve Anderson; Paul Woodward; Hank Dietz

1995-01-01

119

An electro-optic data communication system for the Delft parallel processor  

Microsoft Academic Search

The Delft Parallel Processor (DPP), which has already been operational since 1981, is part of a long term research project for the application of a large-scale Multiple Instruction stream, Multiple Data stream (MIMD) architecture in the field of scientific computing. As a result of this project a second modern version, the DPP84, equipped with up to 16 Processing Elements (PE's),

E. E. E. Frietman; A. B. Ruighaver

1987-01-01

120

A site oriented supercomputer for theoretical physics: The Fermilab Advanced Computer Program Multi Array Processor System (ACMAPS)  

SciTech Connect

The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. The system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.

Nash, T.; Atac, R.; Cook, A.; Deppe, J.; Fischler, M.; Gaines, I.; Husby, D.; Pham, T.; Zmuda, T.; Eichten, E.

1989-03-06

121

Design of a parallel processor for image processing on-board satellites: an application oriented approach  

SciTech Connect

A parallel MIMD type processor for use in image processing applications on board satellite is described. Emphasis is given to the application requirements in terms of processing power, type of parallelism, communication need and to the impact of these requirements on the architecture design. The choice of a MIMD processor with a ring bus, the convenience of a multiple bus structure, the definition of the bus protocol, the synchronization mechanism and the typical performances are presented as successive choices and discussed with regard to the requirements. Possibilities and limits of the architecture are carefully analysed: typical examples of efficiently implementable applications in other fields of image processing are given. But limits of the structure are pointed out for other types of parallel processing. 22 references.

Gailat, C.

1983-01-01

122

Digital signal array processor for NSLS booster power supply upgrade  

SciTech Connect

The booster at the NSLS is being upgraded from 0.75 to 2 pulses per second. To accomplish this, new power supplied for the dipole, quadrupole, and sextupole have been installed. This paper will outline the design and function of the digital signal processor used as the primary control element in the power supply control system.

Olsen, R.; Dabrowski, J. [Brookhaven National Lab., Upton, NY (United States); Murray, J. [State Univ. of New York, Stony Brook, NY (United States)

1993-07-01

123

Processing modes and parallel processors in producing familiar keying sequences.  

PubMed

Recent theorizing indicates that the acquisition of movement sequence skill involves the development of several independent sequence representations at the same time. To examine this for the discrete sequence production task, participants in Experiment 1 produced a highly practiced sequence of six key presses in two conditions that allowed little preparation so that interkey intervals were slowed. Analyses of the distributions of moderately slowed interkey intervals indicated that this slowing was caused by the occasional use of two slower processing modes, that probably rely on independent sequence representations, and by reduced parallel processing in the fastest processing mode. Experiment 2 addressed the role of intention for the fast production of familiar keying sequences. It showed that the participants, who were not aware they were executing familiar sequences in a somewhat different task, had no benefits of prior practice. This suggests that the mechanisms underlying sequencing skills are not automatically activated by mere execution of familiar sequences, and that some form of top-down, intentional control remains necessary. PMID:12739146

Verwey, Willem B

2002-11-30

124

Design and Evaluation of a Novel Real-Shared Cache Module for High Performance Parallel Processor Chip  

Microsoft Academic Search

\\u000a Nowadays, it is very important that integrating parallel processors on a chip offers high performance and low interactive\\u000a response time on applications with fine-grained parallelism and high degree of data sharing. We propose a novel real-shared\\u000a cache module with new multiport ring-bus architecture to overcome the bus bottleneck problem of the existing parallel processors\\u000a chip on shared cache level. A

Zhe Liu; Jeoungchill Shim; Hiroyuki Kurino; Mitsumasa Koyanagi

2004-01-01

125

Frame-level pipelined motion estimation array processor  

Microsoft Academic Search

Abstract—A systolic motion estimation processor (MEP) core architecture implementing the full-search block-matching (FSBM) algorithm is presented. A unique feature of this MEP architecture is its support of frame-level pipelined operation. As such, it is pos- sible to process pixels from consecutive frames without any pro- cessor idle time. It is designed so that no data broadcasting opera- tions are required,

Surin Kittitornkun; Yu Hen Hu

2001-01-01

126

Parallelization Strategies and Performance Analysis of Media Mining Applications on Multi-Core Processors  

Microsoft Academic Search

This paper studies how to parallelize the emerging media mining workloads on existing small-scale multi-core processors and\\u000a future large-scale platforms. Media mining is an emerging technology to extract meaningful knowledge from large amounts of\\u000a multimedia data, aiming at helping end users search, browse, and manage multimedia data. Many of the media mining applications\\u000a are very complicated and require a huge

Wenlong Li; Xiaofeng Tong; Tao Wang; Yimin Zhang; Yen-kuang Chen

2009-01-01

127

A fully parallel vector-quantization processor for real-time motion-picture compression  

Microsoft Academic Search

A vector-quantization (VQ) processor system has been developed aiming at real-time compression of motion pictures using a 0.6-?m triple-metal CMOS technology. The chip employs a fully parallel single-instruction, multiple-data architecture having a two-stage pipeline. Each pipeline segment consists of 19 cycles, thus enabling the execution of a single VQ operation in only 19 clock cycles. As a result, it has

Akira Nakada; Tadashi Shibata; Masahiro Konda; Tatsuo Morimoto; Tadahiro Ohmi

1999-01-01

128

MEMS Microphone Array and Signal Processor for Realtime Object Detection  

NASA Astrophysics Data System (ADS)

We have developed an ultrasonic sound processing system for 3D imaging with 128 microelectromechanical systems (MEMS) microphones and a highly configurable field programmable gate array (FPGA). The system consists of a sensor array board, analog-to-digital converter (ADC) modules, and a processing board. The ultrasonic MEMS sensors are precisely aligned on a printed circuit board (PCB) to form a 16 × 8 planar grid with 112 ° × 50° viewing angles for wide-band signals with the center frequency at 40 KHz.

Maeda, Yasushige; Sugimoto, Masanori; Hashizume, Hiromichi

129

Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition  

DOEpatents

A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

Tomkins, James L. (Albuquerque, NM); Camp, William J. (Albuquerque, NM)

2007-07-17

130

Optimal piecewise linear schedules for LSGP- and LPGS-decomposed array processors via quadratic programming  

NASA Astrophysics Data System (ADS)

The size of a systolic array synthesized from a uniform recurrence equation, whose computations are mapped by a linear function to the processors, matches the problem size. In practice, however, there exist several limiting factors on the array size. There are two dual schemes available to derive arrays of smaller size from large-size systolic arrays based on the partitioning of the large-size arrays into subarrays. In LSGP, the subarrays are clustered one-to-one into the processors of a small-size array, while in LPGS, the subarrays are serially assigned to a reduced-size array. In this paper, we propose a common methodology for both LSGP and LPGS based on polyhedral partitionings of large-size /k-dimensional systolic arrays which are synthesized from /n-dimensional uniform recurrences by linear mappings for allocation and timing. In particular, we address the optimization problem of finding optimal piecewise linear timing functions for small-size arrays. These are mappings composed of linear timing functions for the computations of the subarrays. We study a continuous approximation of this problem by passing from piecewise linear to piecewise quasi-linear timing functions. The resultant problem formulation is then a quadratic programming problem which can be solved by standard algorithms for nonlinear optimization problems.

Zimmermann, K.-H.; Achtziger, W.

2001-09-01

131

From Bit Level Systolic Arrays to HDTV Processor Chips  

Microsoft Academic Search

The paper starts presents the work initially carried out by Queen¿s University and RSRE (now Qinetiq) in the development of advanced architectures and microchips based on systolic array architectures. The paper outlines how this has led to the development of highly complex designs for high definition TV and highlights work both on advanced signal processing architectures and tool flows for

John V. Mccanny; Roger F. Woods; John G. Mcwhirter

2006-01-01

132

DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors  

PubMed Central

Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b) For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

Schmollinger, Martin; Nieselt, Kay; Kaufmann, Michael; Morgenstern, Burkhard

2004-01-01

133

DC simulator of large-scale nonlinear systems for parallel processors  

NASA Astrophysics Data System (ADS)

In this paper it is shown how the idea of the BBD decomposition of large-scale nonlinear systems can be implemented in a parallel DC circuit simulation algorithm. Usually, the BBD nonlinear circuits decomposition was used together with the multi-level Newton-Raphson iterative process. We propose the simulation consisting in the circuit decomposition and the process parallelization on the single level only. This block-parallel approach may give a considerable profit in simulation time though it is strongly dependent on the system topology and, of course, on the processor type. The paper presents the architecture of the decomposition-based algorithm, explains details of its implementation, including two steps of the one level bypassing techniques and discusses a construction of the dedicated benchmarks for this simulation software.

Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel

2012-05-01

134

Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies  

NASA Astrophysics Data System (ADS)

Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.

Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio

2011-10-01

135

A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors.  

PubMed

A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 ?m Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 ?m × 9.5 ?m and each processing element size is 23 ?m × 29 ?m. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking. PMID:22454565

Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

2009-07-27

136

A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors  

PubMed Central

A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE) array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps). A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 ?m Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 ?m × 9.5 ?m and each processing element size is 23 ?m × 29 ?m. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking.

Lin, Qingyu; Miao, Wei; Zhang, Wancheng; Fu, Qiuyu; Wu, Nanjian

2009-01-01

137

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors  

SciTech Connect

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

Aaby, Brandon G [ORNL; Perumalla, Kalyan S [ORNL; Seal, Sudip K [ORNL

2010-01-01

138

Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor  

NASA Astrophysics Data System (ADS)

Programming a multicore processor is difficult. It is even more difficult if the processor has software-managed memory hierarchy, e.g. the IBM Cyclops-64 (C64). A widely accepted parallel programming solution for multicore processor is OpenMP. Currently, all OpenMP directives are only used to decompose computation code (such as loop iterations, tasks, code sections, etc.). None of them can be used to control data movement, which is crucial for the C64 performance. In this paper, we propose a technique called tile percolation. This method provides the programmer with a set of OpenMP pragma directives. The programmer can use these directives to annotate their program to specify where and how to perform data movement. The compiler will then generate the required code accordingly. Our method is a semi-automatic code generation approach intended to simplify a programmer’s work. The paper provides (a) an exploration of the possibility of developing pragma directives for semi-automatic data movement code generation in OpenMP; (b) an introduction of techniques used to implement tile percolation including the programming API, the code generation in compiler, and the required runtime support routines; (c) and an evaluation of tile percolation with a set of benchmarks. Our experimental results show that tile percolation can make the OpenMP programs run on the C64 chip more efficiently.

Gan, Ge; Wang, Xu; Manzano, Joseph; Gao, Guang R.

139

Simulation study of a parallel processor with unbalanced loads. Master's thesis  

SciTech Connect

The purpose of this thesis was twofold: to estimate the impact of unbalanced computational loads on a parallel-processing architecture via Monte Carlo simulation; and second to investigate the impact of representing the dynamics of the parallel-processing problem via animated simulation. It is constrained to the hypercube architecture in which each node is connected in a predetermined topology and allowed to communicate to other nodes through calls to the operating system. Routing of messages through the network is fixed and specified within the operating system. Message-transmission preempts nodal processing causing internodal communications to complicate the concurrent operation of the network. Two independent variables are defined: 1) the degree of imbalance characterizes the nature or severity of the load imbalance, and 2) the degree of locality characterizes the node loadings with respect to node locations across the cube. A SLAM II simulation model of a generic 16 node hypercube was constructed in which each node processes a predetermined number of computational tasks and, following each task, sends a message to a single randomly chosen receiver node. An experiment was designed in which the independent variables, degree of imbalance and degree of locality were varied across two computation-to-IO ratios to determine their separate and interactive effects on the dependent variable, job speedup. ANOVA and regression techniques were used to estimate the relationship between load imbalance, locality, computation-to-IO ratio, and their interactions to job speedup. Results show that load imbalance severely impacts a parallel-processor's performance.

Moore, T.S.

1987-12-01

140

High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects  

DOEpatents

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

2002-01-01

141

Parallel arrays of microtubles formed in electric and magnetic fields  

Microsoft Academic Search

The influence of electric and magnetic fields on microtubule assembly in vitro was studied. Both types of field caused alignment of microtubules in parallel arrays, as demonstrated by electron micrographs. These Iindings suggest a possible role of microtubules in the biological effects of exogenous as well as endogenous

Peter M. Vassilev; Reni T. Dronzine; Maria P. Vassileva; Georgi A. Georgiev

1982-01-01

142

Distributed Generation of Suffix Arrays  

Microsoft Academic Search

. An algorithm for the distributed computation of suffix arrays for large texts is presented.The parallelism model is that of a set of sequential tasks which execute in parallel and exchangemessages among them. The underlying architecture is that of a high bandwidth networkof processors. Our algorithm builds the suffix array by quickly assigning an independent subproblemto each processor and completing

Gonzalo Navarro; Joao Paulo Kitajima; Berthier A. Ribeiro-neto; Nivio Ziviani

1997-01-01

143

Bayesian image reconstruction for emission tomography incorporating Good's roughness prior on massively parallel processors.  

PubMed

Since the introduction by Shepp and Vardi [Shepp, L. A. & Vardi, Y. (1982) IEEE Trans. Med. Imaging 1, 113-121] of the expectation-maximization algorithm for the generation of maximum-likelihood images in emission tomography, a number of investigators have applied the maximum-likelihood method to imaging problems. Though this approach is promising, it is now well known that the unconstrained maximum-likelihood approach has two major drawbacks: (i) the algorithm is computationally demanding, resulting in reconstruction times that are not acceptable for routine clinical application, and (ii) the unconstrained maximum-likelihood estimator has a fundamental noise artifact that worsens as the iterative algorithm climbs the likelihood hill. In this paper the computation issue is addressed by proposing an implementation on the class of massively parallel single-instruction, multiple-data architectures. By restructuring the superposition integrals required for the expectation-maximization algorithm as the solutions of partial differential equations, the local data passage required for efficient computation on this class of machines is satisfied. For dealing with the "noise artifact" a Markov random field prior determined by Good's rotationally invariant roughness penalty is incorporated. These methods are demonstrated on the single-instruction multiple-data class of parallel processors, with the computation times compared with those on conventional and hypercube architectures. PMID:2014243

Miller, M I; Roysam, B

1991-04-15

144

Lumped-Element Planar Strip Array (LPSA) for Parallel MRI  

PubMed Central

The recently introduced planar strip array (PSA) can significantly reduce scan times in parallel MRI by enabling the utilization of a large number of RF strip detectors that are inherently decoupled, and are tuned by adjusting the strip length to integer multiples of a quarter-wavelength (?/4) in the presence of a ground plane and dielectric substrate. In addition, the more explicit spatial information embedded in the phase of the signals from the strip array is advantageous (compared to loop arrays) for limiting aliasing artifacts in parallel MRI. However, losses in the detector as its natural resonance frequency approaches the Larmor frequency (where the wavelength is long at 1.5 T) may limit the signal-to-noise ratio (SNR) of the PSA. Moreover, the PSA’s inherent ?/4 structure severely limits our ability to adjust detector geometry to optimize the performance for a specific organ system, as is done with loop coils. In this study we replaced the dielectric substrate with discrete capacitors, which resulted in both SNR improvement and a tunable lumped-element PSA (LPSA) whose dimensions can be optimized within broad constraints, for a given region of interest (ROI) and MRI frequency. A detailed theoretical analysis of the LPSA is presented, including its equivalent circuit, electromagnetic fields, SNR, and g-factor maps for parallel MRI. Two different decoupling schemes for the LPSA are described. A four-element LPSA prototype was built to test the theory with quantitative measurements on images obtained with parallel and conventional acquisition schemes.

Lee, Ray F.; Hardy, Christopher J.; Sodickson, Daniel K.; Bottomley, Paul A.

2007-01-01

145

Parallel optical interconnects with mixed-signal OEIC and fibre arrays for high-speed communication  

NASA Astrophysics Data System (ADS)

We present a system for direct parallel optical data communication between integrated circuits on neighboured printed circuit boards based on a monolithic integrated CMOS smart pixel array, fibre arrays, and VCSELs. The advantage of our system versus backplane systems is the direct data transfer through the space avoiding planar and area consuming interconnections. The detector chip allows a data rate of 625 Mbit/s per link and is cycled by an optical clock. A simulation of the chip layout showed 260 % more performance versus electrical off-chip interconnects. In principle an 8'8 data transfer is feasible allowing a data rate of 40 Gbit/s. The detector combines an optical receiver array with a digital processor array which executes image processing algorithms. The optical receiver is formed by a PIN photodiode with a diameter of 40 µm, a transimpedance amplifier (TIA) and a decision-making postamplifier. The measured responsivity of the photodiode without antireflection coating is R=0.382 A/W at an optical wavelength of 670 nm. The TIA consists of a CMOS inverter and a PMOS transistor forming the feedback resistor. Together with the postamplifier, formed by a chain of five CMOS inverters and attaining digital CMOS levels, a data rate of 625 Mbit/s is achieved.

Fey, Dietmar; Hoppe, Lutz; Loos, Andreas; Fortsch, Michael; Zimmermann, Horst

2004-09-01

146

Parallel Processing of Large Scale Microphone Arrays for Sound Capture  

NASA Astrophysics Data System (ADS)

Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be consistent with the real room data. Localization of sound sources has been explored using cross-power spectrum time delay estimation and has been evaluated using real room data under slightly, moderately and highly reverberant conditions. To improve the accuracy and reliability of the source localization, an outlier detector that removes incorrect time delay estimation has been invented. To provide speaker selectivity for microphone array systems, a hands-free speaker identification system has been studied. A recently invented feature using selected spectrum information outperforms traditional recognition methods. Measured results demonstrate the capabilities of speaker selectivity from a matched-filtered array. In addition, simulation utilities, including matched -filtering processing of the array and hands-free speaker identification, have been implemented on the massively -parallel nCube super-computer. This parallel computation highlights the requirements for real-time processing of array signals.

Jan, Ea-Ee.

1995-01-01

147

Information transmission in parallel threshold arrays: suprathreshold stochastic resonance.  

PubMed

The information transmitted through a parallel summing array of noisy threshold elements with a common threshold is considered. In particular, using theoretical and numerical analysis, a recently reported [N. G. Stocks, Phys. Rev. Lett. 84, 2310 (2000)] form of stochastic resonance, termed suprathreshold stochastic resonance (SSR), is studied in detail. SSR is observed to occur in arrays with two or more elements and, unlike stochastic resonance (SR) in a single element, gives rise to noise-induced information gains that occur independent of the setting of the threshold or the size of the signal. The transmitted information is maximized when all thresholds are set to coincide with the signal mean. In this situation, and for large arrays, the noise can enhance performance up to approximately half the theoretical noiseless channel capacity. The theory is tested against digital simulation. PMID:11308826

Stocks, N G

2001-03-28

148

3D-processor arrays accelerators for high-performance computing in remote sensing applications  

NASA Astrophysics Data System (ADS)

The conceptualization and employment of efficient 3D processor arrays (3D-PAs) accelerator units in aggregation with the HW/SW co-design technique is developed in this study in a FPGA platform, for the real-time enhancement/reconstruction of large-scale remote sensing (RS) imaging for Geospatial applications. The addressed architecture implements the previously proposed robust fused Bayesian-regularization (RFBR) enhanced radar imaging method for the solution of ill-conditioned inverse spatial spectrum pattern (SSP) estimation problems. Finally, we show how the proposed 3D-PAs accelerators drastically reduce the computational load of the real-world Geospatial imagery tasks suitable for the real-time implementation.

Castillo Atoche, A.; Vazquez Castillo, J.; Rizo Dominguez, L.; Sandoval Gio, J.

2011-10-01

149

A switched interconnection infrastructure to tightly-couple a RISC processor core with a coarse grain reconfigurable array  

Microsoft Academic Search

This paper describes a novel interconnection infrastructure for a general purpose system composed of a RISC processor core and a coarse grain run time reconfigurable array. The proposed infrastructure is based on a nonblocking network of switches and provides a point to point connection between the two processing blocks and all the system peripherals. Modifications to the switches and adoption

Fabio Garzia; Tapani Ahonen; Jari Nurmi

2009-01-01

150

Performance evaluation of the HEP, ELXSI and CRAY X-MP parallel processors on hydrocode test problems  

SciTech Connect

Parallel programming promises improved processing speeds for hydrocodes, magnetohydrocodes, multiphase flow codes, thermal-hydraulics codes, wavecodes and other continuum dynamics codes. This paper presents the results of some investigations of parallel algorithms on three parallel processors: the CRAY X-MP, ELXSI and the HEP computers. Introduction and Background: We report the results of investigations of parallel algorithms for computational continuum dynamics. These programs (hydrocodes, wavecodes, etc.) produce simulations of the solutions to problems arising in the motion of continua: solid dynamics, liquid dynamics, gas dynamics, plasma dynamics, multiphase flow dynamics, thermal-hydraulic dynamics and multimaterial flow dynamics. This report restricts its scope to one-dimensional algorithms such as the von Neumann-Richtmyer (1950) scheme.

Liebrock, L.M.; McGrath, J.F.; Hicks, D.L.

1986-07-07

151

Quantitative analysis of parallel nanowire array assembly by dielectrophoresis.  

PubMed

We describe an assembly technique useful for generating ordered arrays of nanowires (NWs) between electrodes via dielectrophoresis (DEP) and an analysis technique useful for extracting quantitative information about the local electric fields and dielectrophoretic forces from video microscopy data. By tuning the magnitude of the applied electric fields such that the attractive forces on the NWs are of the same order of magnitude as the Brownian forces, and by taking advantage of the inter-NW repulsive forces during DEP, NWs can be assembled into parallel arrays with high reproducibility. By employing a particle-tracking code and analysis of NW motion, we demonstrate a method for quantitative mapping of the dielectrophoretic torques and NW-surface interactions as a function of position on the substrate, which allows a more complete understanding of the dynamics of the assembly and the ability to control these parameters for precise assembly. PMID:21161112

Papadakis, Stergios J; Hoffmann, Joan A; Deglau, David; Chen, Andrew; Tyagi, Pawan; Gracias, David H

2010-12-16

152

Realization of cantilever arrays for parallel proximity imaging  

NASA Astrophysics Data System (ADS)

This paper reports on the fabrication and characterisation of self-actuating, and self-sensing cantilever arrays for large-scale parallel surface scanning. Each cantilever is integrated with a sharp silicon tip, a thermal-driven bimorph actuator, and a piezoresistive deflection sensor. Thus, the tip to the sample distance can be controlled individually for each cantilever. A radius of the tips below 10 nm is obtained, which enables nanometre in-plane surface imaging by Angstrom resolution in vertical direction. The fabricated cantilever probe arrays are also applicable for large-area manipulation, sub-10 nm metrology, bottom-up synthesis, high-speed gas analysis, for different bio-applications like recognition of DNA, RNA, or various biomarkers of a single disease, etc.

Sarov, Y.; Ivanov, Tz; Frank, A.; Zöllner, J.-P.; Nikolov, N.; Rangelow, I. W.

2010-11-01

153

Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays  

Microsoft Academic Search

At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over μClinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital

Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

2005-01-01

154

Parallel magnetic resonance imaging with localized arrays and Sinc interpolation (PILARS)  

Microsoft Academic Search

Large arrays with localized coil sensitivity make it possible to use parallel imaging to significantly accelerate MR imaging speed. However, the need for auto calibration signals limits the actual acceleration factors achievable with large arrays. This paper presents a novel method for parallel imaging with large arrays. The method uses Sinc kernels for k-space data interpolation that only requires one

Shuo Feng; Jim Ji

2011-01-01

155

Fast space-filling molecular graphics using dynamic partitioning among parallel processors.  

PubMed

We present a novel algorithm for the efficient generation of high-quality space-filling molecular graphics that is particularly appropriate for the creation of the large number of images needed in the animation of molecular dynamics. Each atom of the molecule is represented by a sphere of an appropriate radius, and the image of the sphere is constructed pixel-by-pixel using a generalization of the lighting model proposed by Porter (Comp. Graphics 1978, 12, 282). The edges of the spheres are antialiased, and intersections between spheres are handled through a simple blending algorithm that provides very smooth edges. We have implemented this algorithm on a multiprocessor computer using a procedure that dynamically repartitions the effort among the processors based on the CPU time used by each processor to create the previous image. This dynamic reallocation among processors automatically maximizes efficiency in the face of both the changing nature of the image from frame to frame and the shifting demands of the other programs running simultaneously on the same processors. We present data showing the efficiency of this multiprocessing algorithm as the number of processors is increased. The combination of the graphics and multiprocessor algorithms allows the fast generation of many high-quality images. PMID:1772836

Gertner, B J; Whitnell, R M; Wilson, K R

1991-09-01

156

RADCAP: an operational parallel processing facility  

Microsoft Academic Search

An overview is presented of RADCAP, the operational associative array processor (AP) facility installed at Rome Air Development Center (RADC). Basically, this facility consists of a Goodyear Aerospace STARAN associative array (parallel) processor and various peripheral devices, all interfaced with a Honeywell Information Systems (HIS) 645 sequential computer, which runs under the Multics timeshared operating system. The RADCAP hardware and

James D. Feldman; Louis C. Fulmer

1974-01-01

157

A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array  

NASA Astrophysics Data System (ADS)

A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ~27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field.

Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

2013-05-01

158

A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array.  

PubMed

A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ~27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field. PMID:23742570

Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

2013-05-01

159

A precision chirp scaling SAR processor extension to sub-aperture implementation on massively parallel supercomputers  

Microsoft Academic Search

A new concept in SAR raw data focusing algorithms is discussed. The so called “chirp scaling” (CS) technique has allowed the complete elimination of the interpolation step required in conventional ?-K wave domain processing algorithms. This drives to high performance the implementation of aberrationless 2D processors, both for SAR focusing and for analogue problems (seismic wave migration, tomography, etc.). Furthermore

Fabr izio Impagnatiello

1995-01-01

160

Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986  

SciTech Connect

The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

Golub, G.H.

1987-04-30

161

A CMOS-array-computer with on-chip communication hardware developed for massively parallel applications  

Microsoft Academic Search

The authors present a scalable MIMD computer system which was designed to be used as neurocomputer. It is capable of emulating different types of neurons, including complex biologically motivated models based on activity pulses, variable pulse transmission times, and multiple threshold learning rules. It is constructed as an array consisting of nodal computer chips, each containing an on-chip communication processor

M. Schwarz; B. J. Hosticka; M. Kesper; P. Richert; M. Scholles

1991-01-01

162

Parallel processing architecture  

DOEpatents

The parallel processing architecture provides a processor array which accepts input data at a faster rate that its processing elements are able to execute. The main features of this architecture are its programmability, scalability, high bandwidth communication and low cost. It provides high connectivity while maintaining minimum distance between processor elements. This architecture enables construction of a parallel processing with high bandwidth communication in six directions among the neighboring processors. It provides for future growth into more complex and optimized algorithms, and facilitiates incorporation of hardware advances with little effect on currently installed systems. Parallel processing architecture is useful for data sharing in an array, pattern recognition within a data array and sustaining a data input rate which is higher than the pattern recognition algorithm execution time (particle identification in high energy physics).

Crosetto, D.B.

1992-01-01

163

Acousto-optic null-steering adaptive photonic processor architectures for phased arrays  

NASA Astrophysics Data System (ADS)

Two novel all-optical acousto-optic processor designs are introduced for antenna null steering applications. Both designs use an acousto-optic point modulator and a multi- channel acousto-optic deflector in a unique in-line arrangement to form a write/read two color system. One processor is a forward light flow optical design, while the other is a reversible light flow optical architecture. A write-only acousto-optic multichannel correlator processor design is also introduced using a counter-propagating signal correlator design. This processor also uses a time integrating detector such as a two dimensional charge coupled device or a high dynamic range photorefractive crystal for bias free correlation signal detection.

Riza, Nabeel A.

1996-06-01

164

High-performance computational chemistry : hartree-fock electronic structure calculations on massively parallel processors.  

SciTech Connect

The parallel performance of the NWChem version 1.2{alpha} parallel direct-SCF code has been characterized on five massively parallel supercomputers (IBM SP, Kendall Square KSR-2, CRAY T3D and T3E, and Intel Touchstone DELTA) using single-point energy calculations on seven molecules of varying size (up to 389 atoms) and composition (first-row atoms, halogens, and transition metals). The authors compare the performance using both replicated-data and distributed-data algorithms and the original McMurchie-Davidson and recently incorporated TEXAS integrals packages.

Tilson, J. L.; Minkoff, M.; Wagner, A. F.; Shepard, R.; Sutton, P.; Harrison, R. J.; Kendall, R. A.; Wong, A. T.; PNNL

1999-01-01

165

Space-charge-dominated beam dynamics simulations using the massively parallel processors (MPPs) of the cray T3D  

SciTech Connect

Computer simulations using the multi-particle code PARMELA with a three-dimensional point-by-point space charge algorithm have turned out to be very helpful in supporting injector commissioning and operations at Thomas Jefferson National Accelerator Facility (Jefferson Lab, formerly called CEBAF). However, this algorithm, which defines a typical N{sup 2} problem in CPU time scaling, is very time-consuming when N, the number of macro-particles, is large. Therefore, it is attractive to use massively parallel processors (MPPs) to speed up the simulations. Motivated by this, we modified the space charge subroutine for using the MPPs of the Cray T3D. The techniques used to parallelize and optimize the code on the T3D are discussed in this paper. The performance of the code on the T3D is examined in comparison with a Parallel Vector Processing supercomputer of the Cray C90 and an HP 735/125 high-end workstation.

Liu Hongxiu [Thomas Jefferson National Accelerator Facility 12000 Jefferson Avenue, Newport News, Virginia 23606 (United States)

1997-02-01

166

Low-power, real-time digital video stabilization using the HyperX parallel processor  

NASA Astrophysics Data System (ADS)

Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

2011-05-01

167

Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays  

SciTech Connect

At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

2005-09-20

168

Retinal Parallel Processors: More than 100 Independent Microcircuits Operate within a Single Interneuron  

PubMed Central

SUMMARY Most neurons are highly polarized cells with branched dendrites that receive and integrate synaptic inputs and extensive axons that deliver action potential output to distant targets. By contrast, amacrine cells, a diverse class of inhibitory interneurons in the inner retina, collect input and distribute output within the same neuritic network. The extent to which most amacrine cells integrate synaptic information and distribute their output is poorly understood. Here, we show that single A17 amacrine cells provide reciprocal feedback inhibition to presynaptic bipolar cells via hundreds of independent microcircuits operating in parallel. The A17 uses specialized morphological features, biophysical properties, and synaptic mechanisms to isolate feedback microcircuits and maximize its capacity to handle many independent processes. This example of a neuron employing distributed parallel processing rather than spatial integration provides insights into how unconventional neuronal morphology and physiology can maximize network function while minimizing wiring cost.

Grimes, William N.; Zhang, Jun; Graydon, Cole W.; Kachar, Bechara; Diamond, Jeffrey S.

2010-01-01

169

Multiple curve presentation and zooming processor using Field Programmable Gate Arrays  

Microsoft Academic Search

This paper presents the design and implementation of a hardware graphical display custom processor for generating and manipulating plots based on a given set of time varying input signals. The paper primarily focuses on the design to generate plots of two sampled sine waves of 5 KHz and 10 KHz respectively, along with horizontal and vertical axes with proper scaling.

Dhushyanth Venkatesan; Omar Elkeelany

2011-01-01

170

PostProcessor Development of a Hybrid TRR-XY Parallel Kinematic Machine Tool  

Microsoft Academic Search

A hybrid 5-degrees-of-freedom parallel kinematic machine tool constructed using the TRR-XY mechanism has been used to investigate\\u000a the theory of post-processing. The effects of the cutter shapes and machine construction on the post-processing are investigated.\\u000a Some specific parameters only are required to modify the post-processing for the different tools used in real cutting. The\\u000a tilt angle and yaw angle of

S.-L. Chen; T.-H. Chang; I. Inasaki; Y.-C. Liu

2002-01-01

171

A Study of the Phase and Filter Properties of Arrays of Parallel Conductors between Ground Planes  

Microsoft Academic Search

A number of structures are analyzed which consist of arrays of parallel conductors between ground planes or above a single ground plane. These include interdigital line, meander line, a form of helix, \\

J. T. Bolljahn; G. L. Matthaei

1962-01-01

172

Electrostatic quadrupole array for focusing parallel beams of charged particles  

DOEpatents

An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

Brodowski, John (Smithtown, NY)

1982-11-23

173

Arrays, non-determinism, side-effects, and parallelism: A functional perspective  

Microsoft Academic Search

Incremental, functional updates to arrays, executed in a non-deterministic manner, are shown to achieve the same effect (in both efficiency and functionality) as parallel assignment to imperative arrays. The strategy depends critically on the ability of a compiler to recognize not only that the incremental updates can be done destructively, but also that the updates may be done in any

Paul Hudak

1986-01-01

174

Resonator Fiber Optic Gyro with Bipolar Digital Serrodyne Scheme Using a Field-Programmable Gate Array-Based Digital Processor  

NASA Astrophysics Data System (ADS)

A field-programmable gate array-based digital processor is proposed and demonstrated experimentally for a resonator fiber optic gyro (R-FOG) with a bipolar digital serrodyne phase modulation scheme, which we previously proposed especially for R-FOG signal processing and its noise reduction. The processor has multi functions. First, it suppresses both the fast- and slow-drift components in the difference between the laser frequency and the resonator's resonant frequency. The fast-drift with a small amplitude is compensated for by a proportional controller with an oversampling function to reduce the quantization error, while the slow-drift with a large amplitude is tracked using an up/down counter. Second, it automatically adjusts the amplitude of the waveform for bipolar digital serrodyne phase modulation for waves travelling both in the resonator clockwise and counterclockwise. Bipolar laser frequency alternation required to track the resonator's resonant frequency is ideally realized by adjusting the phase modulation amplitude. This automatic adjustment also realizes an additional function for reducing the gyro drift caused by backscattering in the fiber resonator, which was originally implemented in the shape of the waveform for bipolar digital serrodyne phase modulation. Third, the FPGA generates a gyro output with open-loop operation. The R-FOG performance is demonstrated to be improved by applying these three functions with the FPGA.

Wang, Xijing; He, Zuyuan; Hotate, Kazuo

2011-04-01

175

NEUSORT2.0: a multiple-channel neural signal processor with systolic array buffer and channel-interleaving processing schedule.  

PubMed

An emerging class of neuroprosthetic devices aims to provide aggressive performance by integrating more complicated signal processing hardware into the neural recording system with a large amount of electrodes. However, the traditional parallel structure duplicating one neural signal processor (NSP) multiple times for multiple channels takes a heavy burden on chip area. The serial structure sequentially switching the processing task between channels requires a bulky memory to store neural data and may has a long processing delay. In this paper, a memory hierarchy of systolic array buffer is proposed to support signal processing interleavingly channel by channel in cycle basis to match up with the data flow of the optimized multiple-channel frontend interface circuitry. The NSP can thus be tightly coupled to the analog frontend interface circuitry and perform signal processing for multiple channels in real time without any bulky memory. Based on our previous one-channel NSP of NEUSORT1.0 [1], the proposed memory hierarchy is realized on NEUSORT2.0 for a 16-channel neural recording system. Compared to 16 of NEUSORT1.0, NEUSORT2.0 demonstrates a 81.50% saving in terms of areaxpower factor. PMID:19163846

Chen, Tung-Chien; Yang, Zhi; Liu, Wentai; Chen, Liang-Gee

2008-01-01

176

New scalable systolic array processor architecture for simultaneous discrete convolution of k different (n × n) filter coefficient planes with a single image plane  

NASA Astrophysics Data System (ADS)

A new high-performance scalable systolic array processor architecture module is presented which can simultaneously convolute k different (n x n) Filter Coefficient (FC) planes with a single (i x j) pixel Input Image Plane (IP). The architecture will have the capability to simultaneously perform convolution of k different (n x n) FC planes on 600dpi (dot per inch) IPs of size 8 1/2 " x 11" at a rate such that k convoluted Output Image (OI) plane pixels are output each system clock cycle for a system clock cycle time of less than 10 nanoseconds. Bit-parallel arithmetic is used and each IP pixel is 8-bits in length and each FC plane coefficient is 6-bits in length. A new pipelined systolic type architecture module is first developed which can generate one convoluted OI plane pixel per system clock cycle using a level of 'r' hardware resources for the case of (n = 5). The architecture is then extended in a scalable and deeper pipelined manner to allow simultaneous convolution of a single IP pixel, with k different (n×n) FC planes for the case of (n = 5), within one system clock cycle, utilizing less than (k × r) hardware resources. Synthesis and post-implementation VHDL simulation results are shown for an experimental model of the architecture which validates the scalability and functionality of the architecture. Simulation results demonstrate the performance of the architecture to be directly proportional to pipeline depth.

Wong, Albert T.; Heath, J. R.; Lhamon, Michael E.

2003-05-01

177

Parallel sort on a linear array of cellular automata  

Microsoft Academic Search

A cellular automata machine (CA machine) is a structure of interconnected elementary automata, evolving in a parallel and synchronous way. In this paper, we analyse the CA Machine as a general computing structure in which specific computations on the input data must be done. We extend the standard definition of cellular automata to include some requirements of memory to store

J. L. Gordillo; J. V. Luna

1994-01-01

178

Pseudorandom Number Generator. Program-controlled Source of Three 15-bit Random-number Words per Microsecond for AP-120B Array Processors.  

National Technical Information Service (NTIS)

The objective of this project was to provide AP-120B array processors with a program-controlled source of 15-bit random-number words at the rate of three per microsecond. A simple TTL circuit was implemetned to do this. The implementation and testing of a...

W. G. LaFond

1978-01-01

179

Accelerating Haskell array codes with multicore GPUs  

Microsoft Academic Search

Current GPUs are massively parallel multicore processors optimised for workloads with a large degree of SIMD parallelism. Good performance requires highly idiomatic programs, whose development is work intensive and requires expert knowledge. To raise the level of abstraction, we propose a domain-specific high-level language of array computations that captures appropriate idioms in the form of collective array operations. We embed

Manuel M. T. Chakravarty; Gabriele Keller; Sean Lee; Trevor L. McDonell; Vinod Grover

2011-01-01

180

Development of a GUI for parallel connected solar arrays  

Microsoft Academic Search

This work describes the development of a software package with a graphical user interface (GUI) for the evaluation of a solar array. The paper presents MATLAB-based simulations for various configurations of solar panels and compares these configurations on basis of power output and photovoltaic characteristics, via current-voltage and power-voltage curves. The simulation used field data to consider partial shading of

Nisha Nagarajan; Jonathan W. Kimball

2011-01-01

181

Surface pressure survey in a parallel triangular tube array  

NASA Astrophysics Data System (ADS)

An experimental parametric study of the surface pressure on a cylinder in the sixth row of a rotated triangular tube array (P/d=1.375) with air cross-flow has been conducted. A range of static displacements were examined. Jet switching was observed in this array and resulted in large asymmetry in the pressure distribution around the static cylinder even in a geometrically symmetric configuration. The large fluctuations in lift force due to jet switching suggest that it should be more seriously considered when designing against failure. The effect of jet switching on the pressure distribution data was mitigated by deconstructing the pressure distribution into two modes. Forces were calculated from the pressure measurements. No simple parameterisation was found for either the lift or drag force, but it was found that the drag force was only weakly affected by the tube displacement. The data set presented here compliments the data previously presented for normal triangular arrays and represents a valuable reference for validation of simulations and flow-induced vibration models.

Mahon, John; Meskell, Craig

2012-10-01

182

Microcavity LEDs coupled to POF arrays for parallel optical interconnects  

Microsoft Academic Search

A low cost approach for realising parallel optical interconnects, based on the use of micro-cavity LEDs and polymer optical fibres (POFs) has been proposed. LEDs were optimised for coupling to POF, yielding 3% external quantum efficiency into POF. Optimisation of the POF-termination procedure has led to interface losses of 0.3 dB. An alignment scheme, avoiding active alignment, has been discussed

B. Dhoedt; R. Baets; I. Moerman; P. Van Daele; P. Demeester; T. Coosemans; A. Van Hove; R. Bockstaele; C. Sys; L. Vanwassenhove

1998-01-01

183

NEUSORT2.0: A multiple-channel neural signal processor with systolic array buffer and channel-interleaving processing schedule  

Microsoft Academic Search

An emerging class of neuroprosthetic devices aims to provide aggressive performance by integrating more complicated signal processing hardware into the neural recording system with a large amount of electrodes. However, the traditional parallel structure duplicating one neural signal processor (NSP) multiple times for multiple channels takes a heavy burden on chip area. The serial structure sequentially switching the processing task

Tung-Chien Chen; Zhi Yang; Wentai Liu; Liang-Gee Chen

2008-01-01

184

A 32-Channel Lattice Transmission Line Array for Parallel Transmit and Receive MRI at 7 Tesla  

PubMed Central

Transmit and receive RF coil arrays have proven to be particularly beneficial for ultra-high-field MR. Transmit coil arrays enable such techniques as B1+ shimming to substantially improve transmit B1 homogeneity compared to conventional volume coil designs, and receive coil arrays offer enhanced parallel imaging performance and SNR. Concentric coil arrangements hold promise for developing transceiver arrays incorporating large numbers of coil elements. At magnetic field strengths of 7 tesla and higher where the Larmor frequencies of interest can exceed 300 MHz, the coil array design must also overcome the problem of the coil conductor length approaching the RF wavelength. In this study, a novel concentric arrangement of resonance elements built from capacitively-shortened half-wavelength transmission lines is presented. This approach was utilized to construct an array with whole-brain coverage using 16 transceiver elements and 16 receive-only elements, resulting in a coil with a total of 16 transmit and 32 receive channels.

Adriany, Gregor; Auerbach, Edward J.; Snyder, Carl J.; Gozubuyuk, Ark; Moeller, Steen; Ritter, Johannes; van de Moortele, Pierre-Francois; Vaughan, Tommy; Ugurbil, Kamil

2010-01-01

185

175 GMACS\\/mW Charge-Mode Adiabatic Mixed-Signal Array Processor  

Microsoft Academic Search

An adiabatic charge-recycling mixed-signal array with integrated resonant clock generator delivers 175 GMACS (multiply-and-accumulates per second) throughput for every mW of power, a ten-fold improvement over the dynamic power incurred when resonant line drivers are replaced with CMOS drivers. The 3-T CID\\/DRAM cell provides non-destructive 1b-1b multiply accumulation, and integrated quantizers yield 8-bit outputs with +\\/- 1 LSB worst-case mismatch.

Rafal Karakiewicz; Roman Genov; Adeel Abbas; Gert Cauwenberghs

2006-01-01

186

Front-end processor using BBD distributed delay-sum architecture for micromachined ultrasonic sensor array  

Microsoft Academic Search

Micromachined technology makes it possible to integrate the ultrasonic sensor with the front-end processing circuit together, and make an ultrasonic system smart, compact and low cost. An ultrasonic sensor array with resonant frequency of about 60 kHz is fabricated for airborne applications, using sol-gel derived Pb(Zr,Ti)O3 thin film and Si-based micromachining technique. The distributed delay-sum architecture based on a bucket

Yaowu Mo; Tsunehisa Tanaka; Koji Inoue; Kaoru Yamashita; Yoshihiko Suzuki

2003-01-01

187

Method for controlling propagation of data and transform through memory-linked wavefront array processor  

SciTech Connect

This patent describes a method for controlling propagation of data and transforms through a linear array of multiple processing elements interspersed with linking dual port memories where each dual port memory can be accessed simultaneously and without contention by processing elements located on its left and right and where each processing element can be locally controlled by at lest one flow control flag corresponding to each particular dual port memory located adjacent thereto and the flow control flag is selectively controlled by processing elements to the right and left of the dual port memory.

Dolecek, Q.E.

1990-05-01

188

The Cesar computer architecture: A programmable array processor for space applications  

NASA Astrophysics Data System (ADS)

This paper describes the Cesar computer system in terms of architecture, programming environment, and applications. The architecture introduces a programmable hardware implementation approach to algorithms through high order library functions, callable from a Unix environment. Parallelism with no communication overhead is achieved, providing high performance in vector computations. The system has proven to be very effective for signal processing tasks as well as for classes of numerical applications. The system is compact and air-cooled, and the performance growth potential is considered as very promising. The nature of the architecture makes the system interesting both as an applicator specific supercomputer and as a custom configurable system for embedded applications.

Toverud, Morten; Va?Land, Per Atle; Skogstrøm, Roar

1993-08-01

189

CombinePlt and CombineThs user manual: Merging multiple, processor-local plot and time-history data bases produced during a parallel calculation. Revision 1  

SciTech Connect

The CombinePlt and CombineThs post-processing utilities are designed to merge the data in multiple, processor-local plot and time-history data bases produced by the parallel versions of the analysis codes DYNA3D, NIKE3D or PING into a serial database which is compatible with the existing versions of the GRIZ and THUG visualization tools. These utilities make use of the partition assignment file produced by the PartMesh suite for pre-processing utilities to map the data from the processor-local order to global order. These utilities are also capable of translating 64-bit IEEE data bases into 32-bit IEEE data bases which are required for post-processing with GRIZ or THUG on an SGI workstation.

Procassini, R.J.; DeGroot, A.J.

1995-09-21

190

Novel optical wavelength interleaver based on symmetrically parallel-coupled and apodized ring resonator arrays  

Microsoft Academic Search

Optical ring-resonators could be used to synthesize filters with low crosstalk and flat passbands. Their application to DWDM interleaving has been proposed and investigated previously. However, a number of important factors related to this topic have not yet been considered and appropriately addressed. In this paper, we propose a novel scheme of a symmetrically parallel-coupled ring resonator array with coupling

Christopher J. Kaalund; Zhe Jin; Wei Li; Gang-Ding Peng

2003-01-01

191

Generation of second optical harmonic in a macroscopic array of parallel nanowires  

NASA Astrophysics Data System (ADS)

The quadratic optical susceptibility tensor for a macroscopically ordered array of parallel nanowires is determined. An experimental investigation of the polarization properties of signals of the second harmonic from ferroelectric nanowires synthesized in channels of chrysotile asbestos demonstrate that its results can be useful in structural studies.

Belotitskii, V. I.; Kumzerov, Yu. A.; Fokin, A. V.

2009-09-01

192

Ultrafast laser parallel microprocessing using high uniformity binary Dammann grating generated beam array  

NASA Astrophysics Data System (ADS)

Ultrafast laser parallel processing using diffractive multi-beam patterns generated by a spatial light modulator (SLM) has demonstrated a great increase in processing throughput and efficiency. Applications ranging from surface thin film patterning to internal 3D refractive index modification have been recently reported with the parallel processing technology. Periodic and symmetrical geometry design (e.g. N × M beam array) of the multi-beam pattern must be avoided to guarantee the required high uniformity in these applications, which, however, limited the processing flexibility. In this paper, Dammann gratings are used to create diffractive 1 × 5 and 5 × 5 beam arrays for the parallel processing. The 0-th order, observed slightly stronger than the other higher orders, can be adjusted by superimposing a Fresnel zone lens (FZL) and tuning the degree of defocusing at the processing plane. The uniformity (presented by the variation of the machined hole diameter) is measured to be <4% after the adjustment. Additionally, a parallel surface patterning of indium tin oxide (ITO) thin film with periodic array structures was demonstrated using the Dammann grating generated beam array without requiring the complicated geometry separation and the time-consuming positioning.

Kuang, Zheng; Perrie, Walter; Liu, Dun; Edwardson, Stuart P.; Jiang, Yao; Fearon, Eamonn; Watkins, Ken G.; Dearden, Geoff

2013-05-01

193

A design space evaluation of grid processor architectures  

Microsoft Academic Search

In this paper, we survey the design space of a new class of architectures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each

Ramadass Nagarajan; Karthikeyan Sankaralingam; Doug Burger; Stephen W. Keckler

2001-01-01

194

Application of second generation advanced multi-media display processor (AMDP2) in a digital micro-mirror array based HDTV  

Microsoft Academic Search

A second generation of the advanced multi-media display processor (AMDP2) is applied in a consumer multi-media HDTV prototype system. The AMDP2 provides a cost effective and flexible platform which can be used to implement a wide array of video processing algorithms. Examples include interlace to progressive scan conversion, image scaling, and picture enhancement. This paper describes a digital micro mirror

David C. Hutchison; Kazuhiro Ohara; Akira Takeda

2001-01-01

195

Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1  

SciTech Connect

This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.

Werner, N.E.; Van Matre, S.W.

1985-05-01

196

Coding for parallel execution of hardware-in-the-loop millimeter-wave scene generation models on multicore SIMD processor architectures  

NASA Astrophysics Data System (ADS)

Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.

Olson, Richard F.

2013-05-01

197

10-channel fiber array fabrication technique for parallel optical coherence tomography system  

NASA Astrophysics Data System (ADS)

Optical Coherence Tomography (OCT) shows great promise for low intrusive biomedical imaging applications. A parallel OCT system is a novel technique that replaces mechanical transverse scanning with electronic scanning. This will reduce the time required to acquire image data. In this system an array of small diameter fibers is required to obtain an image in the transverse direction. Each fiber in the array is configured in an interferometer and is used to image one pixel in the transverse direction. In this paper we describe a technique to package 15?m diameter fibers on a siliconsilica substrate to be used in a 2mm endoscopic probe tip. Single mode fibers are etched to reduce the cladding diameter from 125?m to 15?m. Etched fibers are placed into a 4mm by 150?m trench in a silicon-silica substrate and secured with UV glue. Active alignment was used to simplify the lay out of the fibers and minimize unwanted horizontal displacement of the fibers. A 10-channel fiber array was built, tested and later incorporated into a parallel optical coherence system. This paper describes the packaging, testing, and operation of the array in a parallel OCT system.

Arauz, Lina J.; Luo, Yuan; Castillo, Jose E.; Kostuk, Raymond K.; Barton, Jennifer

2007-03-01

198

Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit  

SciTech Connect

Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

Daily, Jeffrey A.; Lewis, Robert R.

2011-11-30

199

Instruction systolic array in image processing applications  

NASA Astrophysics Data System (ADS)

The ISATEC parallel computer is the first implementation of an instruction systolic array for the commercial market. The goal i\\of integration of 1024 processors on an add-on-board for PCs has been achieved by the development of a low- power/low-area processor architecture whose instruction set is suited particularly for image processing applications. The paper introduces the concept of the instruction systolic array, its implementation and some application examples in the field of image processing.

Schimmler, Manfred; Lang, Hans-Werner

1996-08-01

200

Anisotropic charge and heat conduction through arrays of parallel elliptic cylinders in a continuous medium  

NASA Astrophysics Data System (ADS)

Arrays of circular pores in silicon can exhibit a phononic bandgap when the lattice constant is smaller than the phonon scattering length, and so have become of interest for use as thermoelectric materials, due to the large reduction in thermal conductivity that this bandgap can cause. The reduction in electrical conductivity is expected to be less, because the lattice constant of these arrays is engineered to be much larger than the electron scattering length. As a result, electron transport through the effective medium is well described by the diffusion equation, and the Seebeck coefficient is expected to increase. In this paper, we develop an expression for the purely diffusive thermal (or electrical) conductivity of a composite comprised of square or hexagonal arrays of parallel circular or elliptic cylinders of one material in a continuum of a second material. The transport parallel to the cylinders is straightforward, so we consider the transport in the two principal directions normal to the cylinders, using a self-consistent local field calculation based on the point dipole approximation. There are two limiting cases: large negative contrast (e.g., pores in a conductor) and large positive contrast (conducting pillars in air). In the large negative contrast case, the transport is only slightly affected parallel to the major axis of the elliptic cylinders but can be significantly affected parallel to the minor axis, even in the limit of zero volume fraction of pores. The positive contrast case is just the opposite: the transport is only slightly affected parallel to the minor axis of the pillars but can be significantly affected parallel to the major axis, even in the limit of zero volume fraction of pillars. The analytical results are compared to extensive FEA calculations obtained using Comsol™ and the agreement is generally very good, provided the cylinders are sufficiently small compared to the lattice constant.

Martin, James E.; Ribaudo, Troy

2013-04-01

201

OpenMP Parallelization of a Mickens Time-Integration Scheme for a Mixed-Culture Biofilm Model and Its Performance on Multi-core and Multi-processor Computers  

Microsoft Academic Search

\\u000a We document and compare the performance of an OpenMP parallelized simulation code for a mixed-culture biofilm model on a desktop\\u000a workstation with two quad core Xeon processors, and on SGI Altix Systems with single core and dual core Itanium processors.\\u000a The underlying model is a parabolic system of highly non-linear partial differential equations, which is discretized in time\\u000a using a

Nasim Muhammad; Hermann J. Eberl

2009-01-01

202

Microplate-compatible biamperometry array for parallel 48-channel amperometric or coulometric measurements.  

PubMed

We report a new reusable electrochemical array for parallel biamperometric measurements that has been designed for use with standard microplates. The 48-channel array uses half of the available 96 wells and has 48 pairs of Pt wire electrodes. Applications to the quantitation of a variety of oxidizable species, including acetaminophen, ascorbic acid, hydroquinone, trolox, and uric acid, are demonstrated in assays that use potassium ferricyanide as an oxidant to produce a mixture of ferri- and ferrocyanide. Hydrogen peroxide quantitation is also demonstrated, based on an assay in which ferrocyanide is oxidized, again to produce a mixture of ferri- and ferrocyanide. Detection limits (signal-to-noise ratio (S/N) = 3) in these assays range from 1 (acetaminophen, R2 = 0.994) to 8 microM (ascorbic acid, R2 = 0.967), and linearity was observed to analyte concentrations of at least 100 microM. We also demonstrate the application of the biamperometric array to enzymatic assays, using the glucose oxidase reaction as an example; following a 20 min enzyme reaction time, a detection limit of 0.1 mM glucose was obtained. These results indicate that applications to other oxidase-based assays are feasible in this high-throughput format. The new electrochemical array employs standard, inexpensive microplates, and the biamperometric measurements are simple, precise, and rapid, requiring only 2 min for 48 parallel measurements. PMID:18341302

Mann, Thomas S; O'Hagan, Liam; Ertl, Peter; Sparkes, Douglas I; Mikkelsen, Susan R

2008-03-15

203

Compiling Fortran 8x array features for the connection machine computer system  

Microsoft Academic Search

The Connection Machine® computer system supports a data parallel programming style, making it a natural target architecture for Fortran 8x array constructs. The Connection Machine Fortran compiler generates VAX code that performs scalar operations and directs the Connection Machine to perform array operations. The Connection Machine virtual processor mechanism supports elemental operations on very large arrays. Most array operators and

Eugene Albert; Kathleen Knobe; Joan D. Lukazt; Guy L. Steele Jr.

1988-01-01

204

Weak-Periodic Stochastic Resonance in a Parallel Array of Static Nonlinearities  

PubMed Central

This paper studies the output-input signal-to-noise ratio (SNR) gain of an uncoupled parallel array of static, yet arbitrary, nonlinear elements for transmitting a weak periodic signal in additive white noise. In the small-signal limit, an explicit expression for the SNR gain is derived. It serves to prove that the SNR gain is always a monotonically increasing function of the array size for any given nonlinearity and noisy environment. It also determines the SNR gain maximized by the locally optimal nonlinearity as the upper bound of the SNR gain achieved by an array of static nonlinear elements. With locally optimal nonlinearity, it is demonstrated that stochastic resonance cannot occur, i.e. adding internal noise into the array never improves the SNR gain. However, in an array of suboptimal but easily implemented threshold nonlinearities, we show the feasibility of situations where stochastic resonance occurs, and also the possibility of the SNR gain exceeding unity for a wide range of input noise distributions.

Ma, Yumei; Duan, Fabing; Chapeau-Blondeau, Francois; Abbott, Derek

2013-01-01

205

Architecture Studies and System Demonstrations of Optical Parallel Processor for AI (Artificial Intelligence) and NI (Neural Intelligence).  

National Technical Information Service (NTIS)

During the last six months we have applied the results of our studies on existing parallel computing architectures for AI and NI to develop the Programmable OptoElectronic Multiprocessor (POEM) architecture. Our goal was design a scalable architecture sui...

S. H. Lee

1988-01-01

206

Numerical Study of a Crossed Loop Coil Array for Parallel Magnetic Resonance Imaging  

SciTech Connect

A coil design has been recently proposed by Temnikov (Instrum Exp Tech. 2005;48;636-637), with higher experimental signal-to-noise ratio than that of the birdcage coil. It is also claimed that it is possible to individually tune it with a single chip capacitor. This coil design shows a great resemble to the gradiometer coil. These results motivated us to numerically simulate a three-coil array for parallel magnetic resonance imaging and in vivo magnetic resonance spectroscopy with multi nuclear capability. The magnetic field was numerical simulated by solving Maxwell's equations with the finite element method. Uniformity profiles were calculated at the midsection for one single coil and showed a good agreement with the experimental data. Then, two more coils were added to form two different coil arrays: coil elements were equally distributed by an angle of a 30 deg. angle. Then, uniformity profiles were calculated again for all cases at the midsection. Despite the strong interaction among all coil elements, very good field uniformity can be achieved. These numerical results indicate that this coil array may be a good choice for magnetic resonance imaging parallel imaging.

Hernandez, J.; Solis, S. E.; Rodriguez, A. O. [Centro de Investigacion e Instrumentacion e Imagenoloia Medica, Universidad Autonoma Metropolitana Iztapalapa, Mexico DF 09340 (Mexico)

2008-08-11

207

Numerical Study of a Crossed Loop Coil Array for Parallel Magnetic Resonance Imaging  

NASA Astrophysics Data System (ADS)

A coil design has been recently proposed by Temnikov (Instrum Exp Tech. 200548636-637), with higher experimental signal-to-noise ratio than that of the birdcage coil. It is also claimed that it is possible to individually tune it with a single chip capacitor. This coil design shows a great resemble to the gradiometer coil. These results motivated us to numerically simulate a three-coil array for parallel magnetic resonance imaging and in vivo magnetic resonance spectroscopy with multi nuclear capability. The magnetic field was numerical simulated by solving Maxwell's equations with the finite element method. Uniformity profiles were calculated at the midsection for one single coil and showed a good agreement with the experimental data. Then, two more coils were added to form two different coil arrays: coil elements were equally distributed by an angle of a 30° angle. Then, uniformity profiles were calculated again for all cases at the midsection. Despite the strong interaction among all coil elements, very good field uniformity can be achieved. These numerical results indicate that this coil array may be a good choice for magnetic resonance imaging parallel imaging.

Hernández, J.; Solis, S. E.; Rodriguez, A. O.

2008-08-01

208

Ultra-Wideband Tapered Slot Antenna Arrays with Parallel-Plate Waveguides  

NASA Astrophysics Data System (ADS)

Owing to their ultra-wideband characteristics, tapered slot antennas (TSAs) are used as element antennas in wideband phased arrays. However, when the size of a TSA is reduced in order to prevent the generation of a grating lobe during wide-angle beam scanning, the original ultra-wideband characteristics are degraded because of increased reflections from the ends of the tapered slot aperture. To overcome this difficulty, we propose a new antenna structure in which parallel-plate waveguides are added to the TSA. The advantage of this new structure is that the reflection characteristics of individual antenna elements are not degraded even if the width of the antenna aperture is very small, i.e., approximately one-half the wavelength of the highest operating frequency. In this study, we propose a procedure for designing the new antenna through numerical simulations by using the FDTD method. In addition, we verify the performance of the antenna array by experiments.

Yamaguchi, Satoshi; Miyashita, Hiroaki; Takahashi, Toru; Otsuka, Masataka; Konishi, Yoshihiko

209

Design and implementation of a parallel array operator for the arbitrary remapping of data.  

SciTech Connect

The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.

Dietz, Steven; Choi, S. E. (Sung-Eun); Chamberlain, B. L. (Bradford L.); Snyder, Lawrence

2003-01-01

210

A survey of processors with explicit multithreading  

Microsoft Academic Search

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads

Theo Ungerer; Borut Robi?; Jurij Šilc

2003-01-01

211

Compiler for an array and vector processing language  

SciTech Connect

A compiler for a Pascal-based language Actus is described. The language is suitable for the expression of the type of parallelism offered by both array and vector processors. The implementation described is for the Cray-1 computer. An objective of the implementation has been to construct an optimizing compiler which can be readily adapted for a range of array and vector processors. As a result the machine-dependent sections of the compiler have been clearly identified. 9 references.

Perrott, R.H.; Crookes, D.; Milligan, P.; Purdy, W.R.M.

1985-05-01

212

Three dimensional flow processor  

DOEpatents

The 3D-flow processor is a general purpose programmable data stream pipelined device that allows fast data movement in six directions for digital signal processing applications such as identifying objects in a matrix in a programmable form. The 3D-flow processor can be used in one dimensional, two dimensional, and three dimensional topologies capable of sustaining an input data rate of up to 100 million data (or frames) per second in a parallel processing system.

Crosetto, D.B.

1992-01-01

213

Silicon-substrate microelectrode arrays for parallel recording of neural activity in peripheral and cranial nerves.  

PubMed

A new process for the fabrication of regeneration microelectrode arrays for peripheral and cranial nerve applications is presented. This type of array is implanted between the severed ends of nerves, the axons of which regenerate through via holes in the silicon and are thereafter held fixed with respect to the microelectrodes. The process described is designed for compatibility with industry-standard CMOS or BiCMOS processes (it does not involve high-temperature process steps nor heavily-doped etch-stop layers), and provides a thin membrane for the via holes, surrounded by a thick silicon supporting rim. Many basic questions remain regarding the optimum via hole and microelectrode geometries in terms of both biological and electrical performance of the implants, and therefore passive versions were fabricated as tools for addressing these issues in on-going work. Versions of the devices were implanted in the rat peroneal nerve and in the frog auditory nerve. In both cases, regeneration was verified histologically and it was observed that the regenerated nerves had reorganized into microfascicles containing both myelinated and unmyelinated axons and corresponding to the grid pattern of the via holes. These microelectrode arrays were shown to allow the recording of action potential signals in both the peripheral and cranial nerve setting, from several microelectrodes in parallel. PMID:7927376

Kovacs, G T; Storment, C W; Halks-Miller, M; Belczynski, C R; Della Santina, C C; Lewis, E R; Maluf, N I

1994-06-01

214

Hign acceleration with a rotating radiofrequency coil array (RRFCA) in parallel magnetic resonance imaging (MRI).  

PubMed

This study explores the performance of a novel hybrid technology, in which the recently introduced rotating RF coil (RRFC) was combined with the principles of Parallel Imaging (PI) to improve the quality and speed of magnetic resonance (MR) images. To evaluate the system, a low-density naturally-decoupled 4-channel rotating radiofrequency coil array (RRFCA) was modelled and investigated. The traditional SENSitivity Encoding (SENSE) reconstruction method and the means of calculating the geometry factor distribution (g map) were adapted to take into account the transient sensitivity encoding. It was found from simulations at 3T that, continuous rotating motion considerably enhanced the coil sensitivity encoding capability, making higher reduction factors in scan time possible. The sensitivity encoding capability can be further improved by choosing an optimal speed of array rotation. Compared to traditional phased-array coils (PACs) with twice as many coil elements, the RRFCA demonstrated clear advantages in terms of quality of reconstruction and superior noise behaviour in all the cases investigated in this initial study. PMID:23366087

Li, Mingyan; Jin, Jin; Trakic, Adnan; Liu, Feng; Weber, Ewald; Li, Yu; Crozier, Stuart

2012-01-01

215

Optoelectronic parallel processing with smart pixel arrays for automated screening of cervical smear imagery  

NASA Astrophysics Data System (ADS)

This thesis investigates the use of optoelectronic parallel processing systems with smart photosensor arrays (SPAs) to examine cervical smear images. The automation of cervical smear screening seeks to reduce human workload and improve the accuracy of detecting pre- cancerous and cancerous conditions. Increasing the parallelism of image processing improves the speed and accuracy of locating regions-of-interest (ROI) from images of the cervical smear for the first stage of a two-stage screening system. The two-stage approach first detects ROI optoelectronically before classifying them using more time consuming electronic algorithms. The optoelectronic hit/miss transform (HMT) is computed using gray scale modulation spatial light modulators in an optical correlator. To further the parallelism of this system, a novel CMOS SPA computes the post processing steps required by the HMT algorithm. The SPA reduces the subsequent bandwidth passed into the second, electronic image processing stage classifying the detected ROI. Limitations in the miss operation of the HMT suggest using only the hit operation for detecting ROI. This makes possible a single SPA chip approach using only the hit operation for ROI detection which may replace the optoelectronic correlator in the screening system. Both the HMT SPA postprocessor and the SPA ROI detector design provide compact, efficient, and low-cost optoelectronic solutions to performing ROI detection on cervical smears. Analysis of optoelectronic ROI detection with electronic ROI classification shows these systems have the potential to perform at, or above, the current error rates for manual classification of cervical smears.

Metz, John Langdon

2000-10-01

216

Real-time processor for staring receivers  

NASA Astrophysics Data System (ADS)

The design, fabrication, and testing of a state-of-the-art, high-throughput on-focal plane IR-image signal processor is described. The processing functions performed are frame differencing and thresholding. The final focal plane array will consist of a 128 x 128-pixel platinum-silicide detector bump-mounted to an on-chip CCD multiplexer. The processor is in a 128-channel parallel-pipeline format. Each channel consists of a pixel regenerator (charge differencer), 128-pixel frame store CCD memory, pixel differencer, second pixel regenerator, thresholder (analog comparator), and digital latch. Four parallel analog outputs and four parallel digital outputs are included. The digital outputs provide a bit map of the image. All analog clock signals (128 KHz, 256 KHz, and 5 MHz) are generated by on-chip TTL-input clock drivers. TTL clock driver inputs are generated off-chip. The technology is low-temperature surface and buried channel CCD/CMOS/indium bump. The design goal was 8-bit resolution at 77 K and 1000 frames/s. Applications include point- or extended-target motion detection with thresholding. Design trade-offs and enhancements (such as on-chip detector gain compensation and a simple window processor) are discussed.

Hanzal, Brian; Peczalski, Andrzej; Schwanebeck, James; Sanderson, Richard; Fossum, Eric

1992-07-01

217

An adaptive multimicroprocessor array computing structure for radar signal processing applications  

Microsoft Academic Search

This paper describes an array processor designed for signal processing in radar applications. The processor consists of a large number of microprocessor-based processing elements and is designed to be adaptive in real-time processing requirements. The processing problem has been considered to have a quite specific data organization and data rate which can be exploited in the architectural design. Parallel processing

C. V. W. Armstrong; H. M. Ahmed; N. A. Brans; E. Fathi

1979-01-01

218

Polycyclic processor  

SciTech Connect

A polycyclic processor concept has been developed to correct some of the problems intrinsic in the present horizontally microcoded processors. The key was the development of the essential VLSI interconnect circuit, which makes it possible to build a polycyclic processor. It is claimed that industrial exploitation of the concept will rapidly expand this market. It is shown how the esl polycyclic processors will reduce system development cost and time while reducing the need for microprogramming specialists. 4 references.

Chatterjee, B.G.

1983-01-01

219

Subwavelength microwave imaging using an array of parallel conducting wires as a lens  

NASA Astrophysics Data System (ADS)

An original realization of a lens capable of transmitting images with subwavelength resolution is proposed. The lens is formed by an array of parallel conducting wires and effectively operates as a telegraph which captures a distribution of the electric field at the front interface of the lens and transmits it to the back side without distortions. This regime of operation is called canalization and is inherent in flat lenses formed by electromagnetic crystals. The theoretical estimations are supported by numerical simulations and experimental verification. The subwavelength resolution of ?/15 and 18% bandwidth of operation are demonstrated at gigahertz frequencies. The proposed lens is capable of transporting subwavelength images without distortion to nearly unlimited distances since the influence of losses to the lens operation is negligibly small.

Belov, Pavel A.; Hao, Yang; Sudhakaran, Sunil

2006-01-01

220

[Expansion of sensitivity area for magnetic resonance imaging of the hand using parallel-array coil].  

PubMed

It is difficult for Rheumatoid Arthritis (RA) patients to remain in a strenuous position for a long time during examinations. The field of view (FOV): 250 mm is needed for hand examinations from the wrist to the finger. Two channel phased array coils are effective to use when examinations of the 'off center' are taken for the upper and lower extremities. The area of the array coils' sensitivity can be expanded by shifting both coil elements 40-60% in the opposite direction of the element's diameter. This method is given credibility due to the increased signal-to-noise ratio (SNR) in the peripheral regions (shifted directions), but loses value in the central area, as indicated by the decrease in SNR. This was confirmed in the image of the hand using visual assessment including the fat suppression technique. It was verified that the sensitivity area was expanded using Scheffe's method of paired comparison (Ura's modified method). An application at the other regions of the body can be expected to be used in the case of using parallel positioned coils during clinical situation. PMID:23089837

Takatsu, Yasuo; Yamamura, Kenichirou; Miyati, Tosiaki; Kimura, Tetsuya; Ueyama, Tsuyoshi; Ishikuro, Akihiro

2012-01-01

221

Femtosecond laser fabrication of micro/nano-channel array devices for parallelized fluorescence detection  

NASA Astrophysics Data System (ADS)

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. Ultrasensitive, highly parallelized fluorescence-based platforms that incorporate a nano/micro-fluidic chip with an array of closely spaced channels would meet this need. We discuss the use of direct femtosecond laser machining to fabricate prototype fluidic chips with arrays of more than one hundred closely spaced channels. Traditional machining techniques involve overlapping focal spots from many laser pulses while scanning the substrate in order to create channels. However, this procedure is not only lengthy but may allow thermal effects to accumulate that degrade the quality of both the channel profile and surrounding substrate material. We are developing a different method for machining a line with just a single pulse, using a combination of cylindrical lenses and an aspheric lens to reshape a near-Gaussian beam into a tight line focus. Channels on the order of 1 micron wide, 5 microns deep, and nearly 2000 microns long may be made this way. We also address the critical issue of mitigating the high autofluorescence responses that arise from the creation of defects by fs-laser machining in fused silica.

Canfield, Brian; Hofmeister, William; Davis, Lloyd

2013-03-01

222

Multi-processor performance on the Tera MTA  

Microsoft Academic Search

The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large,

Allan Snavely; Larry Carter; Jay Boisseau; Amit Majumdar; Kang Su Gatlin; Nick Mitchell; John Feo; Brian Koblenz

1998-01-01

223

Multiple Instruction Stream Processor  

Microsoft Academic Search

Microprocessor design is undergoing a major paradigm shift towards multi-core designs, in anticipation that fu- ture performance gains will come from exploiting thread- level parallelism in the software. To support this trend, we present a novel processor architecture called the Multiple Instruction Stream Processing (MISP) architecture. MISP introduces the sequencer as a new category of architectural resource, and defines a

Richard A. Hankins; Gautham N. Chinya; Jamison D. Collins; Perry H. Wang; Ryan Rakvic; Hong Wang; John Paul Shen

2006-01-01

224

12-channel parallel optical-fiber transmission using a low-drive current 1.3-?m LED array and a p-i-n PD array  

Microsoft Academic Search

Twelve-channel 14-Mb\\/s\\/channel 1-km parallel optical-fiber transmission using a 1×12 low-drive-current 1.3-?m light-emitting diode (LED) linear array and an InGaAs p-i-n photodiode linear array, with the LED drive current as low as 12 mAp-p\\/channel, is discussed. No receiver sensitivity degradation has been observed under simultaneous 12-channel operation. The skew was less than 6 ns after transmission through a 1-km-long 12-channel optical-fiber

Kazuhisa Kaede; Toshio Uji; Takeshi Nagahori; Tetsuyuki Suzaki; Toshitaka Torikai; Junji Hayashi; Isao Watanabe; Masataka Itoh; Hiroshi Honmou; Minoru Shikada

1990-01-01

225

Parallel quicksort  

SciTech Connect

This paper reports on the development of a parallel version of quicksort on a CRCW PRAM. The algorithm uses n processors and a linear space to sort n keys in the expected time O(log n) with large probability.

Vrto, I. (Inst. of Technical Cybernetics, Slovac Academy of Sciences, Dubravska Cesta 9, 842-37 Bratislava (CS)); Chelbus, B.S. (Dept. of Computer Science, Univ. of California, Riverside, CA (US))

1991-04-01

226

Highly parallel introduction of nucleic acids into mammalian cells grown in microwell arrays.  

PubMed

High-throughput cell-based screens of genome-size collections of cDNAs and siRNAs have become a powerful tool to annotate the mammalian genome, enabling the discovery of novel genes associated with normal cellular processes and pathogenic states, and the unravelling of genetic networks and signaling pathways in a systems biology approach. However, the capital expenses and the cost of reagents necessary to perform such large screens have limited application of this technology. Efforts to miniaturize the screening process have centered on the development of cellular microarrays created on microscope slides that use chemical means to introduce exogenous genetic material into mammalian cells. While this work has demonstrated the feasibility of screening in very small formats, the use of chemical transfection reagents (effective only in a subset of cell lines and not on primary cells) and the lack of defined borders between cells grown in adjacent microspots containing different genetic material (to prevent cell migration and to aid spot location recognition during imaging and phenotype deconvolution) have hampered the spread of this screening technology. Here, we describe proof-of-principles experiments to circumvent these drawbacks. We have created microwell arrays on an electroporation-ready transparent substrate and established procedures to achieve highly efficient parallel introduction of exogenous molecules into human cell lines and primary mouse macrophages. The microwells confine cells and offer multiple advantages during imaging and phenotype analysis. We have also developed a simple method to load this 484-microwell array with libraries of nucleic acids using a standard microarrayer. These advances can be elaborated upon to form the basis of a miniaturized high-throughput functional genomics screening platform to carry out genome-size screens in a variety of mammalian cells that may eventually become a mainstream tool for life science research. PMID:20024036

Jain, Tilak; McBride, Ryan; Head, Steven; Saez, Enrique

2009-10-13

227

A proposed scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC detectors  

Microsoft Academic Search

A data acquisition system architecture which draws heavily from the communications industry is proposed. The architecture is totally parallel (i.e. without any bottlenecks), capable of data rates of hundreds of gigabytes per second from the detector and into an array of online processors (i.e. processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online

E. Barsotti; A. Booth; M. Bowden; C. Swoboda; N. Lockyer; R. Vanberg

1990-01-01

228

Parallel Information Processing.  

ERIC Educational Resources Information Center

|Examines parallel computer architecture and the use of parallel processors for text. Topics discussed include parallel algorithms; performance evaluation; parallel information processing; parallel access methods for text; parallel and distributed information retrieval systems; parallel hardware for text; and network models for information…

Rasmussen, Edie M.

1992-01-01

229

Optimizing Compiler for the CELL Processor  

Microsoft Academic Search

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two

Alexandre E. Eichenberger; Kathryn M. O'Brien; Kevin O'Brien; Peng Wu; Tong Chen; Peter H. Oden; Daniel A. Prener; Janice C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; Peng Zhao; Michael Gschwind

2005-01-01

230

Scalable Programming Models for Massively Multicore Processors  

Microsoft Academic Search

Including multiple cores on a single chip has become the dominant mechanism for scaling processor performance. Exponential growth in the number of cores on a single processor is expected to lead in a short time to mainstream computers with hundreds of cores. Scalable implementations of parallel algorithms will be necessary in order to achieve improved single-application performance on such processors.

Michael D. McCool

2008-01-01

231

480-GMACS\\/mW Resonant Adiabatic Mixed-Signal Processor Array for Charge-Based Pattern Recognition  

Microsoft Academic Search

A resonant adiabatic mixed-signal VLSI array delivers 480 GMACS (109 multiply-and-accumulates per second) throughput for every mW of power, a 25-fold improvement over the energy efficiency obtained when resonant clock generator and line drivers are replaced with static CMOS drivers. Losses in resonant clock generation are minimized by activating switches between the LC tank and DC supply with a periodic

Rafal Karakiewicz; Roman Genov; Gert Cauwenberghs

2007-01-01

232

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C  

SciTech Connect

Co-array Fortran (CAF) and Unified Parallel C (UPC) are two emerging languages for single-program, multiple-data global address space programming. These languages boost programmer productivity by providing shared variables for communication instead of message passing. However, the performance of these emerging languages still has room for improvement. In this paper, we study the performance of variants of the NAS MG, CG, SP, and BT benchmarks on several modern cluster architectures to identify challenges that must be met to deliver top performance. We compare CAF and UPC variants of these programs with the original Fortran+MPI code. Today, CAF and UPC programs deliver scalable performance on clusters only when written to use bulk communication. However, our experiments uncovered some significant performance bottlenecks limiting UPC performance on all platforms. We account for the root causes of these performance anomalies and show that they can be remedied with additional compiler improvements, in particular we show that many of these obstacles can be resolved with adequate optimizations by the backend C compilers.

Coarfa, Cristian; Dotsenko, Yuri; Mellor-Crummey, John M.; Cantonnet, Franois; El-Ghazawi, Tarek; Mohanti, Ashrujit; Yao, Yiyi; Chavarría-Miranda, Daniel

2005-06-10

233

The Milstar Advanced Processor  

NASA Astrophysics Data System (ADS)

The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

234

A preliminary architecture for a basic data-flow processor  

Microsoft Academic Search

A processor is described which can achieve highly parallel execution of programs represented in data-flow form. The language implemented incorporates conditional and iteration mechanisms, and the processor is a step toward a practical data-flow processor for a Fortran-level data-flow language. The processor has a unique architecture which avoids the problems of processor switching and memory\\/processor interconnecion that usually limit the

Jack B. Dennis; David P. Misunas

1974-01-01

235

A massively parallel multireference configuration interaction program : the parallel COLUMBUS program.  

SciTech Connect

A massively parallel version of the configuration interaction (CI) section of the COLUMBUS multireference singles and doubles CI (MRCISD) program system is described. In an extension of our previous parallelization work, which was based on message passing, the global array (GA) toolkit has now been used. For each process, these tools permit asynchronous and efficient access to logical blocks of 1- and 2-dimensional (2-D) arrays physically distributed over the memory of all processors. The GAs are available on most of the major parallel computer systems enabling very convenient portability of our parallel program code. To demonstrate the features of the parallel COLUMBUS CI code, benchmark calculations on selected MRCI and SRCI test cases are reported for the CRAY T3D, Intel Paragon, and IBM SP2. Excellent scaling with the number of processors up to 256 processors (CRAY T3D) was observed. The CI section of a 19 million configuration MRCISD calculation was carried out within 20 min wall clock time on 256 processors of a CRAY T3D. Computations with 38 million configurations were performed recently; calculations up to about 100 million configurations seem possible within the near future.

Dachsel, H.; Lischka, H.; Shepard, R.; Nieplocha, J.; Harrison, R. J.; Chemistry; Univ. of Wien; PNNL

1997-01-01

236

Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment  

SciTech Connect

This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15 deg., the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15 deg. sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

Cao, Z.; Walsh, J. L.; Kong, M. G. [Department of Electronic and Electrical Engineering, Loughborough University, Leices LE11 3TU (United Kingdom)

2009-01-12

237

Parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers  

SciTech Connect

In this paper we investigate the feasibility of a massively parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers (VCSELs) to measure surface profiles of displacement,distance, velocity, and liquid flow rate. The concept of the system is demonstrated using a prototype to measure the velocity at different radial points on a rotating disk, and the velocity profile of diluted milk in a custom built diverging-converging planar flow channel. It is envisaged that a scaled up version of the parallel self-mixing imaging system will enable real-time surface profiling, vibrometry, and flowmetry.

Tucker, John R.; Baque, Johnathon L.; Lim, Yah Leng; Zvyagin, Andrei V.; Rakic, Aleksandar D

2007-09-01

238

Hierarchical gate-array routing on a hypercube multiprocessor  

Microsoft Academic Search

Gate-arrays are the most common design style for semicustom VLSI integrated circuits. An important part of the gate-array design process is the routing of wires between the logic elements, which is an extremely compute-intensive operation. This paper presents an algorithm for routing gate-arrays that uses a hypercube connected parallel processor to provide the necessary computation power. In order to make

O. A. Olukotun; T. N. Mudge

1990-01-01

239

Hierarchical Gate-Array Routing on a Hypercube Multiprocessor  

Microsoft Academic Search

Gate-arrays are the most common design style for semicustom VLSI integrated circuits. An important part of the gate-array design process is the routing of wires between the logic elements, which is an extremely compute-intensive operation. This paper presents an algorithm for routing gate-arrays that uses a hypercube connected parallel processor to provide the necessary computation power. In order to make

Oyekunle A. Olukotun; Trevor N. Mudge

1990-01-01

240

PDDP: A data parallel programming model. Revision 1  

SciTech Connect

PDDP, the Parallel Data Distribution Preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP impelments High Performance Fortran compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the (WRERE?) construct. Distribued data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared-memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

Warren, K.H.

1995-06-01

241

Signal processor packaging design  

NASA Astrophysics Data System (ADS)

The Signal Processor Packaging Design (SPPD) program was a technology development effort to demonstrate that a miniaturized, high throughput programmable processor could be fabricated to meet the stringent environment imposed by high speed kinetic energy guided interceptor and missile applications. This successful program culminated with the delivery of two very small processors, each about the size of a large pin grid array package. Rockwell International's Tactical Systems Division in Anaheim, California developed one of the processors, and the other was developed by Texas Instruments' (TI) Defense Systems and Electronics Group (DSEG) of Dallas, Texas. The SPPD program was sponsored by the Guided Interceptor Technology Branch of the Air Force Wright Laboratory's Armament Directorate (WL/MNSI) at Eglin AFB, Florida and funded by SDIO's Interceptor Technology Directorate (SDIO/TNC). These prototype processors were subjected to rigorous tests of their image processing capabilities, and both successfully demonstrated the ability to process 128 X 128 infrared images at a frame rate of over 100 Hz.

McCarley, Paul L.; Phipps, Mickie A.

1993-10-01

242

Parallel array of YBa2Cu3O7-? superconducting Josephson vortex-flow transistors with high current gains  

NASA Astrophysics Data System (ADS)

We have developed a Josephson vortex-flow transistor based on a parallel array of 440 YBa2Cu3O7-? bicrystal grain boundary Josephson junctions. The array's critical current Ic was measured as a function of the control current Ictrl through a control line that is inductively coupled to the array. The device has a highly asymmetric Ic(Ictrl) curve with several regions where a switching behaviour is observed characterized by a maximum current gain gmax = ?Ic/?Ictrl of 19 and a significant dynamic range of 20 ?A at 77 K. In the range 4.7-92 K gmax versus temperature is non-monotonic with a maximum recorded at 77 K.

Chesca, Boris; John, Daniel; Kemp, Matthew; Brown, Jeffrey; Mellor, Christopher

2013-08-01

243

Algorithms for parallel polygon rendering  

SciTech Connect

This book is the result of research in the implementation of polygon-based graphics operations on certain general purpose parallel processors; the aim is to provide a speed-up over sequential implementations of the graphics operations concerned, and the resulting software can be viewed as a subset of the application suites of the relevant parallel machines. A literature review and a brief description of the architectures considered give an introduction into the field. Most algorithms are consistently presented in an extension of the Occam language which includes single instruction multiple data stream (SIMD) data types and operations on them. Methods for polygon rendering - including the operations of filling, hidden surface elimination and smooth shading - are presented for SIMD architectures like the DAP and for a dual-paradigm (SIMD-MIMD) machine constructed out of a DAP-like processor array and a transputer network. Polygon clipping algorithms for both transputer and the DAP are described and contrasted.

Theoharis, T. (St. Catherine's College, Cambridge (GB))

1989-01-01

244

The digital signal processor for the ALCOR millimeter wave radar  

NASA Astrophysics Data System (ADS)

This report describes the use of an array processor for real time radar signal processing. Pulse compression, range marking, and monopulse error computation are some of the functions that will be performed in the array processor for the millimeter wave ALCOR radar augmentation. Real time software design, processor architecture, and system interfaces are discussed in the report.

Ford, R. A.

1980-11-01

245

Benchmarks of Low-Level Vision Algorithms for DSP, FPGA, and Mobile PC Processors  

NASA Astrophysics Data System (ADS)

We present recent results of a performance benchmark of selected low-level vision algorithms implemented on different high-speed embedded platforms. The algorithms were implemented on a digital signal processor (DSP) (Texas Instruments TMS320C6414), a field-programmable gate array (FPGA) (Altera Stratix-I and II families) as well as on a mobile PC processor (Intel Mobile Core 2 Duo T7200). These implementations are evaluated, compared, and discussed in detail. The DSP and the mobile PC implementations, both making heavy use of processor-specific acceleration techniques (intrinsics and resource optimized slicing direct memory access on DSPs or Intel integrated performance primitives Library on mobile PC processors), outperform the FPGA implementations, but at the cost of spending all its resources to these tasks. FPGAs, however, are very well suited to algorithms that benefit from parallel execution.

Baumgartner, Daniel; Roessler, Peter; Kubinger, Wilfried; Zinner, Christian; Ambrosch, Kristian

246

Parallel solution of triangular systems of equations  

SciTech Connect

Methods are presented for the parallel solution of (n*n) lower triangular linear systems suitable for a p processor MIMD computer system where n/2processors as soon as they become available, thus creating a wavefront through the triangular array. The algorithms are shown to run in time (4n-3p-2) for p<2(n-1)/3 and in time 2(n-1) for p>2(n-1)/3. Comparison with recent methods to be competitive, especially when applied to block triangular systems. 7 references.

Evans, D.J.; Dunbar, R.C.

1983-02-01

247

A programmable processing array architecture supporting dynamic task scheduling and module-level prefetching  

Microsoft Academic Search

Massively Parallel Processing Arrays (MPPA) constitute programmable hardware accelerators that excel in the execution of applications exhibiting Data-Level Parallelism (DLP). The concept of employing such programmable accelerators as sidekicks to the more traditional, general-purpose processing cores has very recently entered the mainstream; both Intel and AMD have introduced processor architectures integrating a Graphics Processing Unit (GPU) alongside the main CPU

Junghee Lee; Hyung Gyu Lee; Soonhoi Ha; Jongman Kim; Chrysostomos Nicopoulos

2012-01-01

248

Systolic processor for signal processing  

SciTech Connect

A systolic array is a natural architecture for a high-performance signal processor, in part because of the extensive use of inner-product operations in signal processing. The modularity and simple interconnection of systolic arrays promise to simplify the development of cost-effective, high-performance, special-purpose processors. ESL incorporated has built a proof of concept model of a systolic processor. It is flexible enough to permit experimentation with a variety of algorithms and applications. ESL is exploring the application of systolic processors to image- and signal-processing problems. This paper describes this experimental system and some of its applications to signal processing. ESL is also pursuing new types of systolic architectures, including the VLSI implementation of systolic cells for solving systems of linear equations. These new systolic architectures allow the real-time design of adaptive filters. 14 references.

Frank, G.A.; Greenawalt, E.M.; Kulkarni, A.V.

1982-01-01

249

Maximum likelihood SPECT in clinical computation times using mesh-connected parallel computers.  

PubMed

Extending the work of A.W. McCarthy et al. (1988) and M.I. Miller and B. Roysam (1991), the authors demonstrate that a fully parallel implementation of the maximum-likelihood method for single-photon emission computed tomography (SPECT) can be accomplished in clinical time frames on massively parallel systolic array processors. The authors show that for SPECT imaging on 64x64 image grids, with 96 view angles, the single-instruction, multiple data (SIMD) distributed array processor containing 64(2) processors performs the expectation-maximization (EM) algorithm with Good's smoothing at a rate of 1 iteration/1.5 s. This promises for emission tomography fully Bayesian reconstructions including regularization in clinical computation times which are on the order of 1 min/slice. The most important result of the implementations is that the scaling rules for computation times are roughly linear in the number of processors. PMID:18222845

McCarthy, A W; Miller, M I

1991-01-01

250

A parallel processing VLSI BAM engine.  

PubMed

In this paper emerging parallel/distributed architectures are explored for the digital VLSI implementation of adaptive bidirectional associative memory (BAM) neural network. A single instruction stream many data stream (SIMD)-based parallel processing architecture, is developed for the adaptive BAM neural network, taking advantage of the inherent parallelism in BAM. This novel neural processor architecture is named the sliding feeder BAM array processor (SLiFBAM). The SLiFBAM processor can be viewed as a two-stroke neural processing engine, It has four operating modes: learn pattern, evaluate pattern, read weight, and write weight. Design of a SLiFBAM VLSI processor chip is also described. By using 2-mum scalable CMOS technology, a SLiFBAM processor chip with 4+4 neurons and eight modules of 256x5 bit local weight-storage SRAM, was integrated on a 6.9x7.4 mm(2) prototype die. The system architecture is highly flexible and modular, enabling the construction of larger BAM networks of up to 252 neurons using multiple SLiFBAM chips. PMID:18255644

Hasan, S R; Siong, N K

1997-01-01

251

MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications  

Microsoft Academic Search

This paper introduces MorphoSys, a reconfigurable computing system developed to investigate the effectiveness of combining reconfigurable hardware with general-purpose processors for word-level, computation-intensive applications. MorphoSys is a coarse-grain, integrated, and reconfigurable system-on-chip, targeted at high-throughput and data-parallel applications. It is comprised of a reconfigurable array of processing cells, a modified RISC processor core, and an efficient memory interface unit. This

Hartej Singh; Ming-hau Lee; Guangming Lu; Fadi J. Kurdahi; Nader Bagherzadeh; Eliseu M. Chaves Filho

2000-01-01

252

Implementing the generalized matrix product on a systolic array parallel architecture  

NASA Astrophysics Data System (ADS)

The generalized matrix product includes in its formulation many common array manipulations. It also provides a framework for the expression of a number of important image processing algorithms. It is shown that the generalized matrix product may be implemented in its full generality on systolic array architectures. Two approaches are presented. One approach is to regard the generalized matrix product as a collection of products of small matrices and then consider arrangements of systolic configurations common to the smaller products. A second approach is to embed the two factors of the generalized matrix product in sparse matrices and multiply the sparse matrices using a conventional systolic array.

Stright, James R.

1997-09-01

253

The collimation of intense atomic beams by parallel tube-arrays  

NASA Astrophysics Data System (ADS)

We report on a series of experiments to characterize the collimation of atomic vapor beams. Hot ^6Li atoms were emitted from an atomic beam source and collimated by various tube-array sections. Flux intensity profiles were obtained with several arrays for tubes having a range of shape factors, ? = d/L. Measurements were made in the molecular and hydrodynamic flow regimes (i.e. for different Knudsen parameters).

Huckans, John

2010-03-01

254

Parallel Recording Array Head of Nano-Aperture Flat-Tip Probes for High-Density Near-Field Optical Data Storage  

Microsoft Academic Search

Increasing the memory capacity and data transfer rate of optical data storage to match the market requirements is now a challenging task. To realize a more effective and simple memory system, a parallel near-field optical system has been proposed using vertical cavity surface emitting laser (VCSEL) microprobe array heads. The concept, structure and fabrication process of new flat-tip microprobe arrays

Young-Joo Kim; Kazuhiro Suzuki

2001-01-01

255

A 256×256 CMOS imaging array with wide dynamic range pixels and column-parallel digital output  

Microsoft Academic Search

A stepped reset-gate voltage technique is applied to a CMOS active pixel sensor array to increase dynamic range by 26 dB. A frame rate of 390 frames\\/s is achieved using column-parallel output circuits. Switched-capacitor correlated double-sampling circuits reduce fixed-pattern noise to 4.0 mV (dark). Cyclic analog-to-digital converters achieve approximately 9-b accuracy. At 30 frames\\/s, random noise is 0.56 mV (dark),

Steven Decker; D. McGrath; Kevin Brehmer; Charles G. Sodini

1998-01-01

256

3D optical interconnect mesh network for on-board parallel multiprocessor system based on EOPCB  

NASA Astrophysics Data System (ADS)

A three-dimensional (3-D) 4×4×4 optical interconnect Mesh network scheme for parallel multiprocessor system based on polymer light waveguide electro-optical printed circuit board(EOPCB) is proposed in this paper. The Mesh topological structures of light waveguide interconnects for processor element chip-to-chip on a board, and board-toboard on backplane is constructed. The system consists of 64 processor element chips interconnected in a 3-D Mesh network configuration. Every processor board comprises 4x4 processor element chips with Mesh interconnection. Board-to-board Mesh interconnects are established on a backplane through light waveguide Mesh interconnect topological structure. An additional optical layer with light waveguide structure is used in conventional PCB to construct EOPCB. Vertical cavity surface emitting laser (VCSEL) array is used as optical transmitter array. PIN photodiode array is used as optical receiver array. A MT-compatible direct coupling method is presented to couple light beam between optical transmitter/receiver with light waveguide layer. The optical signals from a processor element chip on a board can transmit to another processor element chip on another board through light waveguide interconnection in the backplane. So 3-D optical interconnection Mesh network for parallel multiprocessor system can be reailzed by EOPCB.

Luo, Fengguang; Cao, Mingcui; Zhou, Xinjun; Xu, Jun; Luo, Zhixiang; Yuan, Jing; Zong, Liangjia; Feng, Yonghua; Chen, Chao; Zhang, Conghui

2007-11-01

257

Application of Array Processing for Parallel Linear Recursive Kalman Filtering in Underwater Acoustic Exploration  

Microsoft Academic Search

In underwater seismic exploration applications using acoustic arrays, the firing pulses suffer from air bubbles due to sudden change in water pressure. In practice, the optimum wave form is obtained after the firing pulse has travelled for about 50 meters in the water column. The wave-form is non-periodic and may be approximated by a damped sinusoidal wave. The wave hits

F. El-Hawary; K. Ravindranath

1986-01-01

258

A 2 V 250 MHz multimedia processor  

Microsoft Academic Search

This paper introduces a VLIW dual-issue RISC processor enhanced with sub-word and DSP instructions for multimedia applications. The processor core integrates 300 k transistors in an 8 mm2 area and is implemented with 64 kB RAM onto a 6.0×6.2 mm2 chip in a 2.O V, 0.3 ?m CMOS process. The processor exploits two modes of parallelism, dual issue instruction execution

T. Yoshida; Y. Shimazu; A. Yamada; E. Holmann; K. Nakakimura; H. Takata; M. Kitao; T. Kishi; H. Kobayashi; M. Sato; A. Mohri; K. Suzuki; Y. Ajioka; K. Higashitani

1997-01-01

259

Parallel p-code for parallel Pascal and other high level languages  

SciTech Connect

Parallel p-code is an intermediate compiler language for parallel processors. It was originally designed as part of a parallel Pascal compiler for NASA's massively parallel processor (MPP). However, it should also be suitable for a wide variety of high level languages and parallel architectures. Parallel p-code is based on a p-code language for serial processors. The authors describe the extensions which were necessary for the parallel environment. 6 references.

Bruner, J.D.; Reeves, A.P.

1983-01-01

260

Parallel P-code for Parallel Pascal and other high level languages  

SciTech Connect

Parallel P-code is an intermediate compiler language for parallel processors. It was originally designed as part of a Parallel Pascal compiler for NASA's Massively Parallel Processor (MPP). However, it should also be suitable for a wide variety of high level languages and parallel architectures. Parallel P-code is based on a P-code language for serial processors; this paper describes the extensions which were necessary for the parallel environment.

Bruner, J.D.; Reeves, A.P.

1983-07-21

261

Design and fabrication of arrays of nanoelectromechanical resonators for parallel detection of biomolecular interactions  

Microsoft Academic Search

The recent achievements of surface and bulk micro and nanomachining techniques combined with the fabrication techniques of integrated circuits have led to the development of miniaturized sensors and actuators exhibiting unprecedented sensitivity. This work is dedicated to the design and fabrication of nanoelectromechanical resonators for parallel detection of biomolecular interactions. The aim is to obtain piezoresistive nanoresonators with a high

C. Bergaud; E. Cocheteau; M. Guirardel; L. Nicu; B. Belier

2001-01-01

262

Parallel assessment of CpG methylation by two-color hybridization with oligonucleotide arrays  

Microsoft Academic Search

We have developed a method for the parallel analysis of multiple CpG sites in genomic DNA for their state of methylation. Hypermethylation of CpG islands within the promoters and 5? exons of genes has been found to be a mechanism of transcriptional inactivation associated with a variety of tumors. The method that we developed relies on the differential reactivity of

Robert P Balog; Y Emi Ponce de Souza; Hue M Tang; Gina M DeMasellis; Boning Gao; Adrian Avila; Desmond J Gaban; David Mittelman; John D Minna; Kevin J Luebke; Harold R Garner

2002-01-01

263

Perfect spin filtering and conditions for Fano antiresonance and Dicke resonance in a parallel coupled triple quantum-dot array  

NASA Astrophysics Data System (ADS)

Electronic transport through a parallel coupled triple quantum dot (tQD) array has been studied by means of nonequilibrium Green's function formalism. By producing an energy difference between the site energy in the upper QDs and down ones, we find that the linear conductance spectrum of this tQD array displays Fano antiresonance and Dicke resonance effects. As the energy difference increases or the tQD chain length increases to a not very large value, the antiresonance valley in the conductance changes to a well-defined insulating band with very steep edges. Meanwhile, the relations of the Fano antiresonance and the well-defined insulating band are explored, and the conditions for the Fano antiresonance and the Dicke resonance are presented. By introducing a Zeeman splitting due to an external magnetic field, the spin-splitting conductance spectrum shows some highly to 100% spin-polarized windows (SPWs). If a gate voltage runs in these SPWs, we can achieve an entirely spin-polarized current, indicating that such a tQD array can be used as a perfect spin filter and a quantum-signal generator. Moreover, the intradot Coulomb repulsion on the electronic transport is also investigated. The results show that the intradot Coulomb repulsion does not affect the device applications for this system mentioned above.

Fu, Hua-Hua; Yao, Kai-Lun

2013-05-01

264

Fully Integrated Linear Single Photon Avalanche Diode (SPAD) Array with Parallel Readout Circuit in a Standard 180 nm CMOS Process  

NASA Astrophysics Data System (ADS)

This paper reports on the development of a SPAD device and its subsequent use in an actively quenched single photon counting imaging system, and was fabricated in a UMC 0.18 ?m CMOS process. A low-doped p- guard ring (t-well layer) encircling the active area to prevent the premature reverse breakdown. The array is a 16×1 parallel output SPAD array, which comprises of an active quenched SPAD circuit in each pixel with the current value being set by an external resistor RRef = 300 k?. The SPAD I-V response, ID was found to slowly increase until VBD was reached at excess bias voltage, Ve = 11.03 V, and then rapidly increase due to avalanche multiplication. Digital circuitry to control the SPAD array and perform the necessary data processing was designed in VHDL and implemented on a FPGA chip. At room temperature, the dark count was found to be approximately 13 KHz for most of the 16 SPAD pixels and the dead time was estimated to be 40 ns.

Isaak, S.; Bull, S.; Pitter, M. C.; Harrison, Ian.

2011-05-01

265

A three-dimensional architecture for a parallel processing photosensing array.  

PubMed

A three-dimensional architecture for a photosensing array has been developed. This silicon based architecture consists of a 10 x 10 array of photosensors with 80 microns diameter, through chip interconnects to the back side of a 500 microns thick silicon wafer. Each photosensor consists of a 300 x 300 microns pn-junction photodiode. The following processes were used to create this photosensing architecture: 1) thermomigration of aluminum pads through an n-type silicon wafer; 2) creation of pn-junction photosensors on one side of the wafer; and 3) creation of aluminum pad ohmic contacts to the thermomigrated, through chip interconnects and the substrate on the back side of the wafer. The electrical and optical characteristics of the three-dimensional architecture indicates that it should be well suited as a photosensing framework around which a "silicon retina" could be built. PMID:1487292

Johansson, T; Abbasi, M; Huber, R J; Normann, R A

1992-12-01

266

Advanced Ultra High Speed Processor Technologies.  

National Technical Information Service (NTIS)

In Phase 1 of this project, POC designed, fabricated, and tested Optical interconnects, and designed a graphics processor based on field programmable gate arrays (FPGAs). P0C's board to board connection approach is based on multichannel integrated optical...

A. Kostrzewski

1996-01-01

267

Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays  

Microsoft Academic Search

We describe a novel sequencing approach that combines non-gel-based signature sequencing with in vitro cloning of millions of templates on separate 5 ?m diameter microbeads. After constructing a microbead library of DNA templates by in vitro cloning, we assembled a planar array of a million template-containing microbeads in a flow cell at a density greater than 3 × 106 microbeads\\/cm2.

Maria Johnson; John Bridgham; George Golda; David H. Lloyd; Davida Johnson; Shujun Luo; Sarah McCurdy; Michael Foy; Mark Ewan; Rithy Roth; Dave George; Sam Eletr; Glenn Albrecht; Eric Vermaas; Steven R. Williams; Keith Moon; Timothy Burcham; Michael Pallas; Robert B. DuBridge; James Kirchner; Karen Fearon; Jen-i Mao; Kevin Corcoran; Sydney Brenner

2000-01-01

268

Field programmable gate array based parallel strapdown algorithm design for strapdown inertial navigation systems.  

PubMed

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-08-15

269

Volume holographic wavelet correlation processor  

Microsoft Academic Search

A volume holographic wavelet correlation processor is proposed and constructed for correlation identification. It is based on the theory of wavelet transforms and the mechanism of angle-multiplexing volume holographic associative storage in a photorefractive crystal. High parallelism and discrimination are achieved with the system. Our research shows that cross-talk noise is significantly reduced with wavelet filtering preprocessing. Correlation outputs can

Wenyi Feng; Yingbai Yan; Guofan Jin; Minxian Wu; Qingsheng He

2000-01-01

270

Advanced parallel processing with supercomputer architectures  

SciTech Connect

This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers.

Hwang, K.

1987-10-01

271

Precisely-controlled fabrication of single ZnO nanoemitter arrays and their possible application in low energy parallel electron beam exposure  

NASA Astrophysics Data System (ADS)

Precisely-controlled fabrication of single ZnO nanoemitter arrays and their possible application in low energy parallel electron beam exposure are reported. A well defined polymethyl methacrylate (PMMA) nanohole template was employed for local solution-phase growth of single ZnO nanoemitter arrays. Chlorine plasma etching for surface smoothing and pulsed-laser illumination in nitrogen for nitrogen doping were performed, which can significantly enhance the electron emission and improve the emitter-to-emitter uniformity in performance. Mechanisms responsible for the field emission enhancing effect are proposed. Low voltage (368 V) e-beam exposure was performed by using a ZnO nanoemitter array and a periodical hole pattern (0.72-1.26 ?m in diameter) was produced on a thin (25 nm) PMMA. The work demonstrates the feasibility of utilizing single ZnO nano-field emitter arrays for low voltage parallel electron beam lithography.

He, H.; She, J. C.; Huang, Y. F.; Deng, S. Z.; Xu, N. S.

2012-03-01

272

Precisely-controlled fabrication of single ZnO nanoemitter arrays and their possible application in low energy parallel electron beam exposure.  

PubMed

Precisely-controlled fabrication of single ZnO nanoemitter arrays and their possible application in low energy parallel electron beam exposure are reported. A well defined polymethyl methacrylate (PMMA) nanohole template was employed for local solution-phase growth of single ZnO nanoemitter arrays. Chlorine plasma etching for surface smoothing and pulsed-laser illumination in nitrogen for nitrogen doping were performed, which can significantly enhance the electron emission and improve the emitter-to-emitter uniformity in performance. Mechanisms responsible for the field emission enhancing effect are proposed. Low voltage (368 V) e-beam exposure was performed by using a ZnO nanoemitter array and a periodical hole pattern (0.72-1.26 ?m in diameter) was produced on a thin (25 nm) PMMA. The work demonstrates the feasibility of utilizing single ZnO nano-field emitter arrays for low voltage parallel electron beam lithography. PMID:22333999

He, H; She, J C; Huang, Y F; Deng, S Z; Xu, N S

2012-02-14

273

Parallelization on a multi-processor system of a solving method for the unsteady Navier-Stokes equations at high Reynolds numbers  

NASA Astrophysics Data System (ADS)

A method for the simulation of viscous incompressible flows around an airfoil is presented which provides resolution of the two dimensional Navier-Stokes equations in stream function and vorticity function formulation. High precision finite difference schemes and Alternating Directions Implicit (ADI) techniques are combined. The algorithm was implemented on a multiprocessor to cope with the parallelism required by ADI methods. The parallelization of the algorithm and performance levels of the parallel code are set out. Pulsed-start numerical simulations on a NACA 0012 airfoil were performed for various airfoil incidence angles and for Reynolds number up to 10 to the 5th power. The phenomena obtained, the influence of the calculation parameters are discussed and comparison between numerical and experimentally visualized results validates the method.

Mane, Laure

1990-08-01

274

Photorefractive processing for large adaptive phased arrays.  

PubMed

An adaptive null-steering phased-array optical processor that utilizes a photorefractive crystal to time integrate the adaptive weights and null out correlated jammers is described. This is a beam-steering processor in which the temporal waveform of the desired signal is known but the look direction is not. The processor computes the angle(s) of arrival of the desired signal and steers the array to look in that direction while rotating the nulls of the antenna pattern toward any narrow-band jammers that may be present. We have experimentally demonstrated a simplified version of this adaptive phased-array-radar processor that nulls out the narrow-band jammers by using feedback-correlation detection. In this processor it is assumed that we know a priori only that the signal is broadband and the jammers are narrow band. These are examples of a class of optical processors that use the angular selectivity of volume holograms to form the nulls and look directions in an adaptive phased-array-radar pattern and thereby to harness the computational abilities of three-dimensional parallelism in the volume of photorefractive crystals. The development of this processing in volume holographic system has led to a new algorithm for phased-array-radar processing that uses fewer tapped-delay lines than does the classic time-domain beam former. The optical implementation of the new algorithm has the further advantage of utilization of a single photorefractive crystal to implement as many as a million adaptive weights, allowing the radar system to scale to large size with no increase in processing hardware. PMID:21085246

Weverka, R T; Wagner, K; Sarto, A

1996-03-10

275

Customization of application specific heterogeneous multi-pipeline processors  

Microsoft Academic Search

In this paper we propose application specific instruction set processors with heterogeneous multiple pipelines to efficiently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specified in C language, the design system can generate a processor with a number of pipelines specifically suitable to the application, and

Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

2006-01-01

276

Field programmable gate arrays and reconfigurable computing  

Microsoft Academic Search

In-system programmable, SRAM-based field programmable gate arrays (FPGAa) can be used to create processors and coprocessors whose internal architecture as well as interconnections can be reconfigured to match the needs of a given application. Exploiting the inherent speed and parallelism of a hardware solution, FPGA-based coprocessors can execute computationally-intensive tasks while maintaining the flexibility of a programmable solution. The subject

Bradly K. Fawcett

1995-01-01

277

High-speed (2.5 Gbps) reconfigurable inter-chip optical interconnects using opto-VLSI processors  

NASA Astrophysics Data System (ADS)

Reconfigurablele optical interconnects enable flexible and high-performance communication in multi-chip architectures to be arbitrarily adapted, leading to efficient parallel signal processing. The use of Opto-VLSI processors as beam steerers and multicasters for reconfigurable inter-chip optical interconnection is discussed. We demonstrate, as proof-of-concept, 2.5 Gbps reconfigurable optical interconnects between an 850nm vertical cavity surface emitting lasers (VCSEL) array and a photodiode (PD) array integrated onto a PCB by driving two Opto-VLSI processors with steering and multicasting digital phase holograms. The architecture is experimentally demonstrated through three scenarios showing its flexibility to perform single, multicasting, and parallel reconfigurable optical interconnects. To our knowledge, this is the first reported high-speed reconfigurable N-to-N optical interconnects architecture, which will have a significant impact on the flexibility and efficiency of large shared-memory multiprocessor machines.

Aljada, Muhsen; Alameh, Kamal E.; Lee, Yong-Tak; Chung, Il-Sug

2006-07-01

278

Advanced Vertical Array Beamformer.  

National Technical Information Service (NTIS)

The advanced vertical array beamformer signal processor accomplishes acoustic beamforming of an underwater vertical array used in shallow water utilizing matched beam processing to suppress generated noise and/or ship radiated noise thereby increasing the...

T. C. Yang J. A. Mobbs

1998-01-01

279

Massively parallel data optimization  

Microsoft Academic Search

Techniques for the automatic layout of arrays in a Fortran compiler supporting Fortran 8× array features and targeted to the Connection Machine computer system are discussed. The goal is primarily to minimize the costs of moving data between processors and secondarily to minimize memory usage. Improved array layout may allow communications operations to be eliminated or to be replaced by

Kathleen Knobe; J. D. Lukas; Guy L. Steele Jr

1988-01-01

280

High performance synchronized dual elliptic curve crypto-processor  

Microsoft Academic Search

In this paper a dual crypto-processor for elliptic curve cryptography has been proposed. The proposed architecture can perform two independent scalar multiplications in parallel over GF(2m). Although in this crypto-processor two independent scalar multiplications are performed in parallel, no extra arithmetic unit is employed in this crypto-processor (except an addition unit). Thus the architecture includes a field multiplier, a field

Abdulah A. Zadeh

2009-01-01

281

Data-parallel, volume-rendering algorithms  

Microsoft Academic Search

In this presentation we consider the image composition scheme for parallel volume rendering in which each processor is assigned a portion of the volume. A processor renders only its data by using any existing volume rendering algorithm. We describe one such parallel algorithm that also takes advantage of vector processing capabilities. The resulting images from all processors are then combined

Roni Yagel; Raghu Machiraju

1995-01-01

282

MRI of the Wrist at 7 Tesla using an 8 Channel Array Coil Combined with Parallel Imaging: Preliminary Results  

PubMed Central

PURPOSE To determine the feasibility of performing MRI of the wrist at 7 Tesla with parallel imaging and to evaluate how acceleration factors(AF) affect signal-to-noise ratio(SNR), contrast-to-noise ratio(CNR), and image quality. MATERIALS AND METHODS This study had institutional review board approval. A 4-transmit 8-receive channel array coil was constructed in–house. Nine healthy subjects were scanned on a 7T whole-body MR scanner. Coronal and axial images of cartilage and trabecular bone micro-architecture(3D-Fast Low Angle Shot(FLASH) with and without fat suppression, TR/TE=20ms/4.5ms, flip angle=10°, 0.169–0.195×0.169–0.195 mm, 0.5–1 mm slice thickness) were obtained with AF 1, 2, 3, 4. T1-weighted fast spin-echo(FSE), proton density-weighted FSE, and multiple-echo data image combination(MEDIC) sequences were also performed. SNR and CNR were measured. Three musculoskeletal radiologists rated image quality. Linear correlation analysis and paired t-tests were performed. RESULTS At higher AF, SNR and CNR decreased linearly for cartilage, muscle, and trabecular bone(rparallel imaging. SNR and CNR decrease with higher AF, but image quality remains above-average.

Chang, Gregory; Friedrich, Klaus M.; Wang, Ligong; Vieira, Renata L.R.; Schweitzer, Mark E.; Recht, Michael P.; Wiggins, Graham C.; Regatte, Ravinder R.

2010-01-01

283

Parallel Recording Array Head of Nano-Aperture Flat-Tip Probes for High-Density Near-Field Optical Data Storage  

NASA Astrophysics Data System (ADS)

Increasing the memory capacity and data transfer rate of optical data storage to match the market requirements is now a challenging task. To realize a more effective and simple memory system, a parallel near-field optical system has been proposed using vertical cavity surface emitting laser (VCSEL) microprobe array heads. The concept, structure and fabrication process of new flat-tip microprobe arrays have been discussed and realized by the preparation of a silicon probe array in this research. Flat-tip probes are advantageous for improving optical properties since they are prepared from materials of high refractive index and the array shows good structural design for the contact head system with good uniformity of the probe height. We have successfully prepared Si nano-aperture probe arrays with the aperture size of 150 to 500 nm using microfabrication techniques, including photolithography, wet chemical etching, and a newly developed aperture formation process which uses a SiO2 mask layer. The microstructural observation of Si flat-tip probe arrays is in good agreement with our design concepts and supports the strong possibility of their application to actual recording heads. We are now developing a monolithic nano-aperture VCSEL probe to complete our parallel near-field optical system with high memory capacity and fast transfer rates in the near future.

Kim, Young-Joo; Suzuki, Kazuhiro; Goto, Kenya

2001-03-01

284

Pthreads for Dynamic Parallelism  

Microsoft Academic Search

Expressing a large number of lightweight, parallel threads in a shared address space significantly eases the task of writing a parallel program. Threads can be dynamically created to execute individual parallel tasks; the implementation schedules these threads onto the processors and effectively balances the load. However, unless the threads scheduler is designed carefully, such a p arallel program may suffer

Girija J. Narlikar; Guy E. Blelloch

1998-01-01

285

Optimal weight extraction for adaptive beamforming using systolic arrays  

NASA Astrophysics Data System (ADS)

Systolic algorithms and architectures for parallel and fully pipelined instantaneous optimal weight extraction for multiple sidelobe canceller (MSC) and minimum variance distortionless response (MVDR) beamformer are presented. The proposed systolic parallelogram array processors are parallel and fully pipelined, and they can extract the optimal weights instantaneously without the need for forward or backward substitution. We also show that the square-root-free Givens method can be easily incorporated to improve the throughput rate and speed up the system. As a result, these MSC and MVDR systolic array weight extraction systems are suitable for real-time very large scale integration (VLSI) implementation in practical radar/sonar systems.

Tang, C. E. T.; Liu, K. J.; Tretter, S. A.

1994-04-01

286

Infrared laser transillumination CT imaging system using parallel fiber arrays and optical switches for finger joint imaging  

NASA Astrophysics Data System (ADS)

The heterodyne detection technique, on which the coherent detection imaging (CDI) method founds, can discriminate and select very weak, highly directional forward scattered, and coherence retaining photons that emerge from scattering media in spite of their complex and highly scattering nature. That property enables us to reconstruct tomographic images using the same reconstruction technique as that of X-Ray CT, i.e., the filtered backprojection method. Our group had so far developed a transillumination laser CT imaging method based on the CDI method in the visible and near-infrared regions and reconstruction from projections, and reported a variety of tomographic images both in vitro and in vivo of biological objects to demonstrate the effectiveness to biomedical use. Since the previous system was not optimized, it took several hours to obtain a single image. For a practical use, we developed a prototype CDI-based imaging system using parallel fiber array and optical switches to reduce the measurement time significantly. Here, we describe a prototype transillumination laser CT imaging system using fiber-optic based on optical heterodyne detection for early diagnosis of rheumatoid arthritis (RA), by demonstrating the tomographic imaging of acrylic phantom as well as the fundamental imaging properties. We expect that further refinements of the fiber-optic-based laser CT imaging system could lead to a novel and practical diagnostic tool for rheumatoid arthritis and other joint- and bone-related diseases in human finger.

Sasaki, Yoshiaki; Emori, Ryota; Inage, Hiroki; Goto, Masaki; Takahashi, Ryo; Yuasa, Tetsuya; Taniguchi, Hiroshi; Devaraj, Balasigamani; Akatsuka, Takao

2004-05-01

287

SCAN secure processor and its biometric capabilities  

NASA Astrophysics Data System (ADS)

This paper presents the design of the SCAN secure processor and its extended instruction set to enable secure biometric authentication. The SCAN secure processor is a modified SparcV8 processor architecture with a new instruction set to handle voice, iris, and fingerprint-based biometric authentication. The algorithms for processing biometric data are based on the local global graph methodology. The biometric modules are synthesized in reconfigurable logic and the results of the field-programmable gate array (FPGA) synthesis are presented. We propose to implement the above-mentioned modules in an off-chip FPGA co-processor. Further, the SCAN-secure processor will offer a SCAN-based encryption and decryption of 32 bit instructions and data.

Kannavara, Raghudeep; Mertoguno, Sukarno; Bourbakis, Nikolaos

2011-04-01

288

Online track processor for the CDF upgrade  

SciTech Connect

A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented.

E. J. Thomson et al.

2002-07-17

289

Reconfigurable VLSI architecture for a database processor  

SciTech Connect

This work brings together the processing potential offered by regularly structured VLSI processing units and the architecture of a database processor-the relational associative processor (RAP). The main motivations are to integrate a RAP cell processor on a few VLSI chips and improve performance by employing procedures exploiting these VLSI chips and the system level reconfigurability of processing resources. The resulting VLSI database processor consists of parallel processing cells that can be reconfigured into a large processor to execute the hard operations of projection and semijoin efficiently. It is shown that such a configuration can provide 2 to 3 orders of magnitude of performance improvement over previous implementations of the RAP system in the execution of such operations. 27 refs.

Oflazer, K.

1983-01-01

290

Development of a parallel detection and processing system using a multidetector array for wave field restoration in scanning transmission electron microscopy  

SciTech Connect

A parallel image detection and image processing system for scanning transmission electron microscopy was developed using a multidetector array consisting of a multianode photomultiplier tube arranged in an 8x8 square array. The system enables the taking of 64 images simultaneously from different scattered directions with a scanning time of 2.6 s. Using the 64 images, phase and amplitude contrast images of gold particles on an amorphous carbon thin film could be separately reconstructed by applying respective 8 shaped bandpass Fourier filters for each image and multiplying the phase and amplitude reconstructing factors.

Taya, Masaki; Matsutani, Takaomi; Ikuta, Takashi; Saito, Hidekazu; Ogai, Keiko; Harada, Yoshihito; Tanaka, Takeo; Takai, Yoshizo [Department of Material and Life Science, Graduate School of Engineering, Osaka University, 2-1 Yamada-oka, Suita, Osaka 565-0871 (Japan); Department of Electronics and Lightwave Sciences, Faculty of Information and Communication Engineering, Osaka Electro-Communication University, Neyagawa, Osaka 572-8530 (Japan); APCO Ltd., 522-10 Kitano, Hachioji, Tokyo 192-0906 (Japan); Department of Mechanical Engineering, Faculty of Engineering, Osaka Sangyo University, Daito, Osaka 574-8530 (Japan); Department of Material and Life Science, Graduate School of Engineering, Osaka University, 2-1 Yamada-oka, Suita, Osaka 565-0871 (Japan)

2007-08-15

291

A quantitative ultrastructural study of peripheral blood lymphocytes containing parallel tubular arrays in Epstein-Barr virus and cytomegalovirus mononucleosis.  

PubMed Central

In the normal peripheral circulation there exists a subpopulation of lymphocytes that is ultrastructurally distinct. This lymphocyte is identified with the electron microscope by the presence of cytoplasmic microtubulelike inclusions called parallel tubular arrays (PTAs) and contains Fc-receptors for cytophilic antibody. In this study, lymphocytes containing PTAs (PTA-lymphocytes) were quantitated from serial peripheral blood specimens obtained from two patients with Epstein-Barr virus (EBV) mononucleosis and two patients with cytomegalovirus (CMV) mononucleosis. These data were then correlated with the clinical state of the patient. It was determined that both the percentage and absolute number of PTA-lymphocytes were highest during the acute phase of the illness. In follow-up specimens, three of the four patients' absolute lymphocyte count fell to within normal limits before the absolute PTA-lymphocyte count. In one patient, the absolute PTA-lymphocyte count was significantly elevated 13 months after the initial clinic visit. Although the PTA-lymphocyte count was highest during the acute phase of the illness, there was no consistent correlation with the clinical state of the patient during follow-up. The estimation of absolute PTA-lymphocyte counts was determined to be valid after a morphometric analysis of the cellular areas occupied by PTAs during the acute and convalescent phases of the disease revealed no statistical differences. Electron microscopy was also performed on the peripheral blood of a patient with syphilis. Although a hematologic workup of this patient during the acute phase of his illness revealed a large number of atypical lymphocytes, electron-microscopic examination of the same specimen revealed both a normal number and a normal percentage of PTA-lymphocytes. The immunologic role of this ultrastructurally distinct third population (non-T, non-B) of lymphocytes, or "killer cells," in the course of infectious mononucleosis is discussed. Images Figure 1 Figure 2 Figure 3 Figure 4

Payne, C. M.; Tennican, P. M.

1982-01-01

292

Final Report, Center for Programming Models for Scalable Parallel Computing: CoArray Fortran, Grant Number DE-FC02-01ER25505  

Microsoft Academic Search

The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for

Robert W. Numrich

2008-01-01

293

MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY  

SciTech Connect

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

2008-01-01

294

Efficient design space exploration of high performance embedded out-of-order processors  

Microsoft Academic Search

Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high- end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out- of-order processors, in-order processor design methodolo- gies which

Stijn Eyerman; Lieven Eeckhout; Koen De Bosschere

2006-01-01

295

Throughput-Oriented Multicore Processors  

NASA Astrophysics Data System (ADS)

Many important commercial server applications are throughput-oriented. Chip multiprocessors (CMPs) are ideally suited to handle these workloads, as the multiple processors on the chip can independently service incoming requests. To date, most CMPs have been built using a small number of high-performance superscalar processor cores. However, the majority of commercial applications exhibit high cache miss rates, larger memory footprints, and low instruction-level parallelism, which leads to poor utilization on these CMPs. An alternative approach is to build a throughput-oriented, multithreaded CMP from a much larger number of simpler processor cores. This chapter explores the tradeoffs involved in building such a simple-core CMP. Two case studies, the Niagara and Niagara 2 CMPs from Sun Microsystems, are used to illustrate how simple-core CMPs are built in practice and how they compare to CMPs built from more traditional high-performance superscalar processor cores. The case studies show that simple-core CMPs can have a significant performance/watt advantage over complex-core CMPs.

Laudon, James; Golla, Robert; Grohoski, Greg

296

A VLSI design concept for parallel iterative algorithms  

NASA Astrophysics Data System (ADS)

Modern VLSI manufacturing technology has kept shrinking down to the nanoscale level with a very fast trend. Integration with the advanced nano-technology now makes it possible to realize advanced parallel iterative algorithms directly which was almost impossible 10 years ago. In this paper, we want to discuss the influences of evolving VLSI technologies for iterative algorithms and present design strategies from an algorithmic and architectural point of view. Implementing an iterative algorithm on a multiprocessor array, there is a trade-off between the performance/complexity of processors and the load/throughput of interconnects. This is due to the behavior of iterative algorithms. For example, we could simplify the parallel implementation of the iterative algorithm (i.e., processor elements of the multiprocessor array) in any way as long as the convergence is guaranteed. However, the modification of the algorithm (processors) usually increases the number of required iterations which also means that the switch activity of interconnects is increasing. As an example we show that a 25×25 full Jacobi EVD array could be realized into one single FPGA device with the simplified ?-rotation CORDIC architecture.

Sun, C. C.; Götze, J.

2009-05-01

297

High performance parallel computers for science: New developments at the Fermilab advanced computer program  

SciTech Connect

Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

1988-08-01

298

A 0.8-?m CMOS two-dimensional programmable mixed-signal focal-plane array processor with on-chip binary imaging and instructions storage  

Microsoft Academic Search

This paper presents a CMOS chip for the parallel acquisition and concurrent analog processing of two-dimensional (2-D) binary images. Its processing function is determined by a reduced set of 19 analog coefficients whose values are programmable with 7-b accuracy. The internal programming signals are analog, but the external control interface is fully digital. On-chip nonlinear digital-to-analog converters (DAC's) map digitally

R. Dominguez-Castro; S. Espejo; A. Rodriguez-Vazquez; R. A. Carmona; P. Foldesy; A. Zarandy; P. Szolgay; T. Sziranyi; T. Roska

1997-01-01

299

A 0.8- m CMOS Two-Dimensional Programmable Mixed-Signal Focal-Plane Array Processor with On-Chip Binary Imaging and Instructions Storage  

Microsoft Academic Search

This paper presents a CMOS chip for the parallel acquisition and concurrent analog processing of two-dimensional (2-D) binary images. Its processing function is determined by a reduced set of 19 analog coefficients whose values are pro- grammable with 7-b accuracy. The internal programming signals are analog, but the external control interface is fully digital. On- chip nonlinear digital-to-analog converters (DAC's)

Rafael Dom ´ inguez-Castro; Servando Espejo; Angel Rodr ´ iguez-Vazquez; Ricardo A. Carmona; Akos Zar; Tamas Roska

1997-01-01

300

Compact optical temporal processors  

NASA Astrophysics Data System (ADS)

Optical signal processing can be done with time-lens devices. A temporal processor based on chirp-z transformers is suggested. This configuration is more compact than a conventional 4-f temporal processor. On the basis of implementation aspects of such a temporal processor, we did a performance analysis. This analysis leads to the conclusion that an ultrafast optical temporal processor can be implemented.

Mendlovic, David; Melamed, Oded; Ozaktas, Haldun M.

1995-07-01

301

Abstract Families of Processors.  

National Technical Information Service (NTIS)

A 'processor' is a Turing-like automaton with auxiliary storage. An 'abstract family' of processors (AFP) consists of all processors that use the storage in the same way. Properties common to all AFP are derived. For a family of operations to be the outpu...

G. F. Rose

1968-01-01

302

Parallel Search On Video Cards  

Microsoft Academic Search

Recent approaches exploiting the massively parallel ar- chitecture of graphics processors (GPUs) to acceler- ate database operations have achieved intriguing results. While parallel sorting received significant attention, par- allel search has not been explored. With p-ary search we present a novel parallel search algorithm for large-scale database index operations that scales with the number of processors and outperforms traditional thread-level

Tim Kaldewey; Jeff Hagen; Eric Sedlar

303

Parallel programming interface for distributed data  

NASA Astrophysics Data System (ADS)

The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Catalogue identifier: AEEF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17?698 No. of bytes in distributed program, including test data, etc.: 166?173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2-256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with the distribution takes only a few seconds to run.

Wang, Manhui; May, Andrew J.; Knowles, Peter J.

2009-12-01

304

Enhancing forced air convection heat transfer from an array of parallel plate fins using a heat pipe  

Microsoft Academic Search

An experimental study of heat transfer from an array of copper plate fins supported by a copper heat pipe and cooled by forced air flow is presented. The results are compared to an identical array of copper fins, but supported by a solid copper rod. The primary variable is the height of the fin stack, while the fin pitch, air

Z. Zhao; C. T. Avedisian

1997-01-01

305

Processor-Group Aware Runtime Support for Shared-and Global-Address Space Models  

SciTech Connect

Exploiting multilevel parallelism using processor groups is becoming increasingly important for programming on high-end systems. This paper describes a group-aware run-time support for shared-/global- address space programming models. The current effort has been undertaken in the context of the Aggregate Remote Memory Copy Interface (ARMCI) [5], a portable runtime system used as a communication layer for Global Arrays [6], Co-Array Fortran (CAF) [9], GPSHMEM [10], Co-Array Python [11], and also end-user applications. The paper describes the management of shared memory, integration of shared memory communication and RDMA on clusters with SMP nodes, and registration. These are all required for efficient multi- method and multi-protocol communication on modern systems. Focus is placed on techniques for supporting process groups while maximizing communication performance and efficiently managing global memory system-wide.

Krishnan, Manoj Kumar; Tipparaju, Vinod; Palmer, Bruce; Nieplocha, Jarek

2004-12-07

306

Measurement of Fault-Tolerant Parallel Processors.  

National Technical Information Service (NTIS)

Computer systems that continue to operate correctly in the presence of faults are vital for many important applications. A number of measurement techniques can be used to determine how well computers detect and recover from faults. Both time to recover an...

J. W. Roberts A. Mink R. J. Carpenter

1987-01-01

307

Joint Experimentation on Scalable Parallel Processors  

Microsoft Academic Search

The JESPP project exemplifies the ready utility of High Performance computing for large-scale simulations. J9, the Joint Experimentation Program at the US Joint Forces Command, is tasked with ensuring that the United States' armed forces benefit from improvements in doctrine, interoperability, and integration. In order to simulate the future battlespace, J9 must expand the capabilities of its JSAF code along

Robert F. Lucas; Dan M. Davis

2003-01-01

308

GPU Computing: Programming a Massively Parallel Processor  

Microsoft Academic Search

Summary form only given. Many researchers have observed that general purpose computing with programmable graphics hardware (GPUs) has shown promise to solve many of the world's compute intensive problems, many orders of magnitude faster the conventional CPUs. The challenge has been working within the constraints of a graphics programming environment and limited language support to leverage this huge performance potential.

Ian Buck

2007-01-01

309

SUDS: Automatic Parallelization for Raw Processors  

Microsoft Academic Search

A computer can never be too fast or too cheap. Com- puter systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind of system built by humans. A computer sys- tem's throughput, however, is constrained by that sys- tem's ability to find concurrency. Given a particular target

Matthew Ian Frank

2003-01-01

310

Transitive closure on the imagine stream processor  

SciTech Connect

The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

Griem, Gorden; Oliker, Leonid

2003-11-11

311

Parallel I/O Systems  

NSDL National Science Digital Library

* Redundant disk array architectures,* Fault tolerance issues in parallel I/O systems,* Caching and prefetching,* Parallel file systems,* Parallel I/O systems, * Parallel I/O programming paradigms, * Parallel I/O applications and environments, * Parallel programming with parallel I/O

Apon, Amy

312

A low-power multi-core media co-processor for mobile application processors  

Microsoft Academic Search

A multi-core co-processor for mobile application processors is introduced. It provides low-power, high-throughput, fully software-based acceleration of multimedia processing. The test chip fabricated in a 65 nm CMOS technology consumes 620 mW in H.264 720p 60 fps decoding and 9.7 mW in MPEG-4 AAC decoding. In the maximum workload of H.264 decoding, a symmetrical parallelization achieves 7.5times performance enhancement by

Shuou Nomura; Fumihiko Tachibana; Tetsuya Fujita; Chen Kong Teh; Hiroyuki Usui; Fumiyuki Yamane; Yukimasa Miyamoto; Takahiro Yamashita; Hiroyuki Hara; Mototsugu Hamada; Yoshiro Tsuboi

2009-01-01

313

Sandia secure processor : a native Java processor.  

SciTech Connect

The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

2003-08-01

314

FPGA Processor Implementation for the Forward Kinematics of the UMDH.  

National Technical Information Service (NTIS)

The focus of this research was on the implementation of a forward kinematic algorithm for the Utah MIT Dexterous Hand (UMDH). Specifically, the algorithm was synthesized from mathematical models onto a Field Programmable Gate Array (FPGA) processor. The c...

S. M. Parmley

1997-01-01

315

Frequency Compression and Expansion Using and Electrooptical Processor.  

National Technical Information Service (NTIS)

A method of compressing or expanding the frequency of signals while keeping the signals' original gross temporal relationship relies upon an electrooptical processor. An apertured mask is interposed between an area-array charge coupled device (CCD) and a ...

K. Bromley

1979-01-01

316

Evaluation of Parallel Domain Decomposition Algorithms  

Microsoft Academic Search

In this talk we describe and evaluate several heuristics for the parallel decomposition of data into n processors. This problem is usually cast as a graph partitioning problem, requiring that each processor have equal amount of data, and that inter-processor communication be mini- mized. Since this problem is NP-complete, sev- eral heuristics have been developed: combina- torial, geometric, spectral, multilevel,

Izaguirre Paz

317

Issues and challenges in compiling for graphics processors  

Microsoft Academic Search

Graphics has been one of the best success stories of parallel processing. Using a unique combination of specialized hardware and aspecialized programming model, game developers routinely write high performance code using millions of threads. Each Generation of graphic processors (GPU's) delivers higher performance and is more programmable then the last. Unlike CPU's, these processors are designed from the beginning to

Norm Rubin

2008-01-01

318

Customization of application specific heterogeneous multi-pipeline processors  

Microsoft Academic Search

In this paper we propose Application Specic Instruction Set Pro- cessors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the ap-

Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

2006-01-01

319

Context-Switching Techniques for Decoupled Multithreaded Processors  

Microsoft Academic Search

Multithreading techniques use coarse grain parallelism to speed up computation of a multithreaded workload by better utilization of the resources of a single processor. The paper surveys context switching techniques for multithreaded single-issue processors and classifies the techniques due to the events that trigger a context switch. We survey static and dynamic block interleaving techniques and demonstrate the application of

Jochen Kreuzinger; Theo Ungerer

1999-01-01

320

Broadcasting collective operation contributions throughout a parallel computer  

SciTech Connect

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

Faraj, Ahmad (Rochester, MN)

2012-02-21

321

Reconfigurable computer array: The bridge between high speed sensors and low speed computing  

SciTech Connect

A universal limitation of RF and imaging front-end sensors is that they easily produce data at a higher rate than any general-purpose computer can continuously handle. Therefore, Los Alamos National Laboratory has developed a custom Reconfigurable Computing Array board to support a large variety of processing applications including wideband RF signals, LIDAR and multi-dimensional imaging. The boards design exploits three key features to achieve its performance. First, there are large banks of fast memory dedicated to each reconfigurable processor and also shared between pairs of processors. Second, there are dedicated data paths between processors, and from a processor to flexible I/O interfaces. Third, the design provides the ability to link multiple boards into a serial and/or parallel structure.

Robinson, S.H.; Caffrey, M.P.; Dunham, M.E.

1998-06-16

322

Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of Rb atoms in a quantum information processor  

NASA Astrophysics Data System (ADS)

We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb87 atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 and 480 nm) and have nonoverlapping Bragg-matched frequency response at these wavelengths, so that there will be no cross talk when proportional frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a tellurium dioxide crystal (TeO2). The designed and fabricated AOD has more than 100 resolvable spots, widely separated band shapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 and 480 nm), and a 4 ?s or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD.

Kim, Sangtaek; McLeod, Robert R.; Saffman, M.; Wagner, Kelvin H.

2008-04-01

323

Graphics Processor Based Implementation of Bioinformatics Codes  

Microsoft Academic Search

We created a powerful computing platform based on video cards with the goal of accelerating the performance of bioinformatics codes. To satisfy the demands of the video gaming industry, modern graphics processing units (GPUs) have become very advanced computational devices, using a large set of stream processors to render multiple pixels in parallel. Recently, computer scientists have taken interest in

Andrew Bellenir; Christian Trefftz; Greg Wolffe

2008-01-01

324

Thermal solutions to Pentium processors in TCP in notebooks and sub-notebooks  

Microsoft Academic Search

Less than one year after the introduction of the 90 MHz Pentium Processor in Pin Grid Array (PGA) package, 75 MHz Pentium Processor in Tape Carrier Package (TCP) has been introduced for applications in mobile products. Notebooks and sub-notebooks using the 75 MHz Pentium Processor in TCP are expected to be on the market early 1995. In this paper, we

H. Xie; M. Aghazadeh; W. Lui; K. Haley

1995-01-01

325

Scioto: A Framework for Global-ViewTask Parallelism  

SciTech Connect

We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

2008-09-09

326

Systolic Array Adaptive Beamforming.  

National Technical Information Service (NTIS)

A computing architecture which reflects the specific requirements of an optimum adaptive space-time array processor is discussed. Specifically, a frequency domain implementation of the minimum variance distortionless response (MVDR) beamformer is describe...

N. L. Owsley

1987-01-01

327

Processor equivalence for daisy chain load sharing processors  

Microsoft Academic Search

A linear daisy chain of processors in which processor load is divisible and shared among the processors is examined. It is shown that two or more processors can be collapsed into a single equivalent processor. This equivalence allows a characterization of the nature of the minimal time solution, a simple method to determine when to distribute load for linear daisy

THOMAS G. ROBERTAZZI

1993-01-01

328

Multilist Scheduling. A New Parallel Programming Model.  

National Technical Information Service (NTIS)

Parallel programming requires task scheduling to optimize performance; this primarily involves balancing the load over the processors. In many cases, it is critical to perform task scheduling at runtime. For example, (1) in many parallel applications the ...

I. C. Wu H. T. Kung P. Steenkiste D. O'Hallaron G. Thompson

1993-01-01

329

Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System  

Microsoft Academic Search

Queueing models for a simple heterogeneous multiple processor system are presented, analyzed, and compared. Each model is distinguished by a job routing strategy which is designed to reduce the average job turnaround time by balancing the total load among the processors. In each case an arriving job is routed by a job dispatcher to one of m parallel processors. The

Yuan-chieh Chow; Walter H. Kohler

1979-01-01

330

Quantum imager and processor for guided interceptor applications  

NASA Astrophysics Data System (ADS)

Full integration of large sensor and processor arrays will eventually require the development of logic devices and circuits that can be scaled down to nanometer geometries. This level of integration can only be obtained by a revolutionary approach to avoiding the scaling limits of conventional integrated circuits (ICs). Nanoelectronics is an emerging IC technology that will exploit quantum electronic effects in nanometer-sized heterojunction devices to enable single chip image processing systems consisting of millions of sensor-processor elements. In a quantum image processor, each sensor will be integrated with a high performance signal processor to provide a chip-level throughput that approaches 1 million TeraOps. This paper discusses the thermodynamic and other physical limits to developing monolithic, quantum sensor processors and summarizes the processor architecture, device and circuit concepts that are in development.

Frazier, Gary; McCarley, Paul

1993-06-01

331

Pentium® III Processor Implementation Tradeoffs  

Microsoft Academic Search

This paper discusses the implementation tradeoffs of the Pentium® III processor. The Pentium III processor implements a new extension of the IA-32 instruction set called the Internet Streaming Single-Instruction, Multiple- Data (SIMD) Extensions (Internet SSE). The processor is based on the Pentium® Pro processor microarchitecture. The initial development goals for the Pentium III processor were to balance performance, cost, and

Jagannath Keshava; Vladimir Pentkovski

332

Challenge of Massively Parallel Computing.  

National Technical Information Service (NTIS)

Since the mid-1980's, there have been a number of commercially available parallel computers with hundreds or thousands of processors. These machines have provided a new capability to the scientific community, and they been used successfully by scientists ...

D. E. Womble

1999-01-01

333

Performance Model for Massive Parallelism.  

National Technical Information Service (NTIS)

A popular argument is that vector and parallel architectures should not be carried to extremes because the scalar or serial portion of the code will eventually dominate. Since pipeline stages and extra processors obviously add hardware cost, a corollary t...

J. L. Gustafson

1988-01-01

334

High Density 3-D Integration Technology for Massively Parallel Signal Processing in Advanced Infrared Focal Plane Array Sensors  

Microsoft Academic Search

The paper describes a platform technology for three-dimensional (3-D) integration of multiple layers of silicon integrated circuits. The technology promises to dramatically enhance on-chip signal processing capabilities of a variety of sensor and actuator devices hybridized with Si electronics. Among these applications are high performance infrared focal plane array detectors

D. Temple; C. A. Bower; D. Malta; J. E. Robinson; P. R. Coffman; M. R. Skokan; T. B. Welch

2006-01-01

335

Design and microfabrication of a high-aspect-ratio PDMS microbeam array for parallel nanonewton force measurement and protein printing  

NASA Astrophysics Data System (ADS)

Cell and protein mechanics has applications ranging from cellular development to tissue engineering. Techniques such as magnetic tweezers, optic tweezers and atomic force microscopy have been used to measure cell deformation forces of the order of piconewtons to nanonewtons. In this study, an array of polymeric polydimethylsiloxane (PDMS) microbeams with diameters of 10-40 µm and lengths of 118 µm was fabricated from Sylgard® with curing agent concentrations ranging from 5% to 20%. The resulting spring constants were 100-300 nN µm-1. The elastic modulus of PDMS was determined experimentally at different curing agent concentrations and found to be 346 kPa to 704 kPa in a millimeter-scale array and ~1 MPa in a microbeam array. Additionally, the microbeam array was used to print laminin for the purpose of cell adhesion. Linear and nonlinear finite element analyses are presented and compared to the closed-from solution. The highly compliant, transparent, biocompatible PDMS may offer a method for more rapid throughput in cell and protein mechanics force measurement experiments with sensitivities necessary for highly compliant structures such as axons.

Sasoglu, F. M.; Bohl, A. J.; Layton, B. E.

2007-03-01

336

Design and microfabrication of a high-aspect-ratio PDMS microbeam array for parallel nanonewton force measurement and protein printing  

Microsoft Academic Search

Cell and protein mechanics has applications ranging from cellular development to tissue engineering. Techniques such as magnetic tweezers, optic tweezers and atomic force microscopy have been used to measure cell deformation forces of the order of piconewtons to nanonewtons. In this study, an array of polymeric polydimethylsiloxane (PDMS) microbeams with diameters of 10–40 m and lengths of 118 m was

F Mert Sasoglu; Andrew J Bohl; Bradley E Layton

2007-01-01

337

Performance analysis of parallel processing systems  

Microsoft Academic Search

A centralized parallel processing system with job splitting is considered. In such a system, jobs wait in a central queue, which is accessible by all the processors, and are split into independent tasks that can be executed on separate processors. This parallel processing system is modeled as a bulk arrival MX\\/M\\/c queueing system where customers and bulks correspond to tasks

R. Nelson; D. Towsley; A. N. Tantawi

1987-01-01

338

Pthreads for dynamic and irregular parallelism  

Microsoft Academic Search

High performance applications on shared memory machines have typically been written in a coarse grained style, with one heavyweight thread per processor. In comparison, programming with a large number of lightweight, parallel threads has several advantages, including simpler coding for programs with irregular and dynamic parallelism, and better adaptability to a changing number of processors. The programmer can express a

Girija J. Narlikar; Guy E. Blelloch

1998-01-01

339

Parallel Algorithms on the ASTRA SIMD Machine  

Microsoft Academic Search

In view of the tremendous computing power jump of modern RISC processors the interest in parallel computing seems to be thinning out. Why use a complicated system of parallel processors, if the problem can be solved by a single powerful micro-chip? It is a general law, however, that exponential growth will always end by some kind of a saturation, and

G. Odor; F. Rohrbach; G. Vesztergombi; G. Varga; F. Tatrai

1995-01-01

340

Parallel Computer Modeling of Complex Electromagnetic Systems  

Microsoft Academic Search

The HFSS frequency domain computer code, installed on an AIX-3 computer, was used to model a geometrically complex electromagnetic system operating over many decades of frequencies in nonhomogeneous media. First, we used a single processor unit, and then 16 processors operating in parallel. Since the Maxwell equations used in the HFSS code show no time dependence, parallel processing can be

A. S. Podgorski; Marek B. Zaremba; M. Vogel

2000-01-01

341

Virtual Reality and Parallel Systems Performance Analysis  

Microsoft Academic Search

Recording and analyzing the dynamics of application program, system software, and hardware interactions are the keys to understanding and tuning the performance of massively parallel systems. Because massively parallel systems contain hundreds or thousands of processors, each potentially with many dynamic performance metrics, the performance data occupy a sparsely populated, high-dimensional space. These dynamic performance metrics for each processor define

Daniel A. Reed; Keith A. Shields; Will H. Scullin; Luis F. Tawera; Christopher L. Elford

1995-01-01

342

SHAKE parallelization  

PubMed Central

SHAKE is a widely used algorithm to impose general holonomic constraints during molecular simulations. By imposing constraints on stiff degrees of freedom that require integration with small time steps (without the constraints) we are able to calculate trajectories with time steps larger by approximately a factor of two. The larger time step makes it possible to run longer simulations. Another approach to extend the scope of Molecular Dynamics is parallelization. Parallelization speeds up the calculation of the forces between the atoms and makes it possible to compute longer trajectories with better statistics for thermodynamic and kinetic averages. A combination of SHAKE and parallelism is therefore highly desired. Unfortunately, the most widely used SHAKE algorithm (of bond relaxation) is inappropriate for parallelization and alternatives are needed. The alternatives must minimize communication, lead to good load balancing, and offer significantly better performance than the bond relaxation approach. The algorithm should also scale with the number of processors. We describe the theory behind different implementations of constrained dynamics on parallel systems, and their implementation on common architectures.

Elber, Ron; Ruymgaart, A. Peter; Hess, Berk

2011-01-01

343

Beta Operations: Efficient Implementation of a Primitive Parallel Operation.  

National Technical Information Service (NTIS)

The ever decreasing cost of computer processors has created a great interest in multi-processor computers. However, along with the increased power that this parallelism brings, comes increased complexity in programming. One approach to lessening this comp...

E. R. Cohn R. W. Haddad

1986-01-01

344

Massively parallel computing system  

DOEpatents

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

Benner, R.E.; Gustafson, J.L.; Montry, G.R.

1989-03-01

345

Efficient execution of Kahn process networks on multi-processor systems using protothreads and windowed FIFOs  

Microsoft Academic Search

As single-processor systems are ceasing to scale effectively, multi-processor systems are becoming more and more popular. While there are many challenges of designing multi-processor systems in hardware, writing efficient parallel applications that utilize the computing capability of multiple processors may reveal to be even more challenging. In this paper, we introduce a framework that allows to efficiently execute applications expressed

Wolfgang Haid; Lars Schor; Kai Huang; Iuliana Bacivarov; Lothar Thiele

2009-01-01

346

Combining Task and Data Parallelism to Speed up Protein Folding on a Desktop Grid Platform Is ecient protein folding possible with CHARMM on the United Devices MetaProcessor?  

Microsoft Academic Search

The steady increase of computing power at lower and lower cost enables molecular dynamics simulations to investigate the process of protein folding with an explicit treatment of water molecules. Such simulations are typically done with well known computational chemistry codes like CHARMM. Desktop grids such as the United Devices MetaProcessor are highly attractive platforms, since scavenging for unused machines on

M. Taufer; T. Stricker; G. Settanni; A. Cavalli; A. Caflisch

347

Evaluating Local Indirect Addressing in SIMD (Single Instruction Stream, Multiple Data Stream) processors.  

National Technical Information Service (NTIS)

In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The a...

D. Middleton S. Tomboulian

1989-01-01

348

Gang scheduling a parallel machine  

SciTech Connect

Program development on parallel machines can be a nightmare of scheduling headaches. We have developed a portable time sharing mechanism to handle the problem of scheduling gangs of processors. User program and their gangs of processors are put to sleep and awakened by the gang scheduler to provide a time sharing environment. Time quantums are adjusted according to priority queues and a system of fair share accounting. The initial platform for this software is the 128 processor BBN TC2000 in use in the Massively Parallel Computing Initiative at the Lawrence Livermore National Laboratory. 2 refs., 1 fig.

Gorda, B.C.; Brooks, E.D. III.

1991-03-01

349

20-GFLOPS QR processor on a Xilinx Virtex-E FPGA  

NASA Astrophysics Data System (ADS)

Adaptive beamforming can play an important role in sensor array systems in countering directional interference. In high-sample rate systems, such as radar and comms, the calculation of adaptive weights is a very computational task that requires highly parallel solutions. For systems where low power consumption and volume are important the only viable implementation is as an Application Specific Integrated Circuit (ASIC). However, the rapid advancement of Field Programmable Gate Array (FPGA) technology is enabling highly credible re-programmable solutions. In this paper we present the implementation of a scalable linear array processor for weight calculation using QR decomposition. We employ floating-point arithmetic with mantissa size optimized to the target application to minimize component size, and implement them as relationally placed macros (RPMs) on Xilinx Virtex FPGAs to achieve predictable dense layout and high-speed operation. We present results that show that 20GFLOPS of sustained computation on a single XCV3200E-8 Virtex-E FPGA is possible. We also describe the parameterized implementation of the floating-point operators and QR-processor, and the design methodology that enables us to rapidly generate complex FPGA implementations using the industry standard hardware description language VHDL.

Walke, Richard L.; Smith, Robert W.; Lightbody, Gaye

2000-11-01

350

Mapping of neural networks onto the memory-processor integrated architecture.  

PubMed

In this paper, an effective memory-processor integrated architecture, called memory-based processor array for artificial neural networks (MPAA), is proposed. The MPAA can be easily integrated into any host system via memory interface. Specifically, the MPA system provides an efficient mechanism for its local memory accesses allowed by row and column bases, using hybrid row and column decoding, which is suitable for computation models of ANNs such as the accessing and alignment patterns given for matrix-by-vector operations. Mapping algorithms to implement the multilayer perceptron with backpropagation learning on the MPAA system are also provided. The proposed algorithms support both neuron and layer level parallelisms which allow the MPAA system to operate the learning phase as well as the recall phase in the pipelined fashion. Performance evaluation is provided by detailed comparison in terms of two metrics such as the cost and number of computation steps. The results show that the performance of the proposed architecture and algorithms is superior to those of the previous approaches, such as one-dimensional single-instruction multiple data (SIMD) arrays, two-dimensional SIMD arrays, systolic ring structures, and hypercube machines. PMID:12662777

Kim, Youngsik; Noh, Mi Jung; Han, Tack Don; Kim, Shin Dug

1998-08-01

351

A Sub 100mW H.264 MP@L4.1 Integer-Pel Motion Estimation Processor Core for MBAFF Encoding with Reconfigurable Ring-Connected Systolic Array and Segmentation-Free, Rectangle-Access Search-Window Buffer  

NASA Astrophysics Data System (ADS)

We describe a sub 100-mW H.264 MP@L4.1 integerpel motion estimation processor core for low power video encoder. It supports macro block adaptive frame field (MBAFF) encoding and bidirectional prediction for a resolution of 1920×1080 pixels at 30fps. The proposed processor features a novel hierarchical algorithm, reconfigurable ring-connected systolic array architecture and segmentation-free, rectangle-access search window buffer. The hierarchical algorithm consists of a fine search and a coarse search. A complementary recursive cross search is newly introduced in the coarse search. The fine search is adaptively carried out, based on an image analysis result obtained by the coarse search. The proposed systolic array architecture minimizes the amount of transferred data, and lowers computation cycles for the coarse and fine searches. In addition, we propose a novel search window buffer SRAM that has instantaneous accessibility to a rectangular area with arbitrary location. The processor core has been designed with a 90nm CMOS design rule. Core size is 2.5×2.5mm2. One core supports one-reference-frame and dissipates 48mW at 1V. Two core configuration consumes 96mW for two-reference-frame search.

Murachi, Yuichiro; Miyakoshi, Junichi; Hamamoto, Masaki; Iinuma, Takahiro; Ishihara, Tomokazu; Yin, Fang; Lee, Jangchung; Kawaguchi, Hiroshi; Yoshimoto, Masahiko

352

Parallel test description and analysis of parallel test system speedup through Amdahl's law  

Microsoft Academic Search

This paper will outline various types of parallel test, discuss an adaptation of Amdahl's law to parallel test, and discuss possible extensions to ATML for parallel test. Amdahl's law is an equation in computer science that is used to derive the speedup gained through parallelizing the software; it expresses the speedup as a function of number of processors. Parallel test

Nathan Waivio; Rolling Meadows

2007-01-01

353

Programmable systolic trigger processor for FERA bus data.  

National Technical Information Service (NTIS)

A generic CAMAC based trigger processor module for fast processing of large amounts of ADC data, has been designed. This module has been realised using complex programmable gate arrays (LCAs from XILINX). The gate arrays have been connected to memories an...

G. Appelquist B. Hovander B. Sellden C. Bohm

1992-01-01

354

Iterative color-multiplexed, electro-optical processor  

Microsoft Academic Search

A noncoherent optical vector-matrix multiplier using a linear LED source array and a linear P-I-N photodiode detector array has been combined with a 1-D adder in a feedback loop. The resultant iterative optical processor and its use in solving simultaneous linear equations are described. Operation on complex data is provided by a novel color-multiplexing system.

Demetri Psaltis; David Casasent; Mark Carlotto

1979-01-01

355

CUBA: an architecture for efficient CPU\\/co-processor data communication  

Microsoft Academic Search

ABSTRACT Data-parallel co-processors have the potential to improve performance in highly parallel regions of code when coupled to a general-purpose CPU. However, applications often have to be modied,in non-intuitive and complicated ways to mit- igate the cost of data marshalling between the CPU and the co-processor. In some applications the overheads cannot be amortized and co-processors are unable to provide

Isaac Gelado; John H. Kelm; Shane Ryoo; Steven S. Lumetta; Nacho Navarro; Wen-mei W. Hwu

2008-01-01

356

Adaptive CFAR PI Processor for Radar Target Detection in Pulse Jamming  

Microsoft Academic Search

A new parallel algorithm for signal processing and a parallel systolic architecture of a CFAR processor with adaptive post detection integration (API) are presented in this paper. The processor proposed is used for effective target detection in a single range resolution cell of a radar when echoes from small airborne targets are performed in conditions of pulse jamming. The main

Vera P. Behar; Christo A. Kabakchiev; Lyubka A. Doukovska

2000-01-01

357

Allergen arrays for antibody screening and immune cell activation profiling generated by parallel lipid dip-pen nanolithography.  

PubMed

Multiple-allergen testing for high throughput and high sensitivity requires the development of miniaturized immunoassays that allow for a large test area and require only a small volume of the test analyte, which is often available only in limited amounts. Developing such miniaturized biochips containing arrays of test allergens needs application of a technique able to deposit molecules at high resolution and speed while preserving its functionality. Lipid dip-pen nanolithography (L-DPN) is an ideal technique to create such biologically active surfaces, and it has already been successfully applied for the direct, nanoscale deposition of functional proteins, as well as for the fabrication of biochemical templates for selective adsorption. The work presented here shows the application of L-DPN for the generation of arrays of the ligand 2,4-dinitrophenyl[1,2-dipalmitoyl-sn-glycero-3-phosphoethanolamine-N-[6-[(2,4-dinitrophenyl)amino]hexanoyl] (DNP)] onto glass surfaces as a model system for detection of allergen-specific Immunoglobin E (IgE) antibodies and for mast cell activation profiling. PMID:22278752

Sekula-Neuner, Sylwia; Maier, Jana; Oppong, Emmanuel; Cato, Andrew C B; Hirtz, Michael; Fuchs, Harald

2012-01-26

358

Efficient Algorithms for Reconfiguration in VLSI\\/WSI Arrays  

Microsoft Academic Search

The issue of developing efficient algorithms for reconfiguring processor arrays in the presence of faulty processors and fixed hardware resources is discussed. The models discussed consist of a set of identical processors embedded in a flexible interconnection structure that is configured in the form of a rectangular grid. An array grid model based on single-track switches is considered. An efficient

Vwani P. Roychowdhury; Jehoshua Bruck; Thomas Kailath

1990-01-01

359

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

Karasick, Michael S. (Ridgefield, CT); Strip, David R. (Albuquerque, NM)

1996-01-01

360

Beyond Processor Sharing.  

National Technical Information Service (NTIS)

While the (Egalitarian) Processor-Sharing (PS) discipline offers crucial insights in the performance of fair resource allocation mechanisms, it is inherently limited in analyzing and designing differentiated scheduling algorithms such as Weighted Fair Que...

R. Nunez Queija S. Aalto S. C. Borst U. Ayesta V. Misra

2007-01-01

361

3081/E processor  

SciTech Connect

The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future.

Kunz, P.F.; Gravina, M.; Oxoby, G.; Rankin, P.; Trang, Q.; Ferran, P.M.; Fucci, A.; Hinton, R.; Jacobs, D.; Martin, B.

1984-04-01

362

Adaptive signal processor  

SciTech Connect

An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 ..mu..sec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed.

Walz, H.V.

1980-07-01

363

Measuring Parallelism in Computation-Intensive Scientific\\/Engineering Applications  

Microsoft Academic Search

Describes COMET, (concurrency measurement tool), a software tool for measuring parallelism in large scientific\\/engineering applications. The proposed tool measures the total parallelism present in programs, filtering out the effects of communication\\/synchronization delays, finite storage, limited number of processors, the policies for management of processors and storage, etc. Although an ideal machine that can exploit the total parallelism is not realizable,

Manoj Kumar

1988-01-01

364

On Building a Kohonen Neural Net Parallel Simulator  

Microsoft Academic Search

This paper presents a Kohonen neural net parallel simulator. The simulator was developed on a Sequent Balance 8000 computer system. Comparative results emphasize the impact of the different strategies of parallelization, the number of processors involved and inter-processor communications over the efficiency of the parallel implementation. The simulator was used in a pattern recognition application, in the automatic synthesis of

C. V. Buhusi; David J. Evans

1994-01-01

365

Buffered coscheduling for parallel programming and enhanced fault tolerance  

DOEpatents

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

Petrini, Fabrizio (Los Alamos, NM); Feng, Wu-chun (Los Alamos, NM)

2006-01-31

366

Stochastic propagation of an array of parallel cracks: Exploratory work on matrix fatigue damage in composite laminates  

SciTech Connect

Transverse cracking of polymeric matrix materials is an important fatigue damage mechanism in continuous-fiber composite laminates. The propagation of an array of these cracks is a stochastic problem usually treated by Monte Carlo methods. However, this exploratory work proposes an alternative approach wherein the Monte Carlo method is replaced by a more closed-form recursion relation based on fractional Brownian motion.'' A fractal scaling equation is also proposed as a substitute for the more empirical Paris equation describing individual crack growth in this approach. Preliminary calculations indicate that the new recursion relation is capable of reproducing the primary features of transverse matrix fatigue cracking behavior. Although not yet fully tested or verified, this cursion relation may eventually be useful for real-time applications such as monitoring damage in aircraft structures.

Williford, R.E.

1989-09-01

367

Pringle Parallel Computer.  

National Technical Information Service (NTIS)

The Pringle is a 64 processor MIMD computer with a 64 M (8 bit) instructions per second execution rate. The Pringle runs programs written for the Configurable, Highly Parallel (CHiP) Computer. That is, the Pringle executes the 64 separate instruction stre...

A. A. Kapauau J. T. Field D. B. Gannon L. Snyder

1984-01-01

368

Simple Fast Parallel Hashing  

Microsoft Academic Search

A hash table is a representation of a set in a linear size data structure that supports constant-time membership queries. We show how to construct a hash table for any given set of n keys in O(lg lg n) parallel time with high probability, using n processors on a weak version of a

Joseph Gil; Yossi Matias

1994-01-01

369

National Resource for Computation in Chemistry (NRCC). Attached scientific processors for chemical computations: a report to the chemistry community  

SciTech Connect

The demands of chemists for computational resources are well known and have been amply documented. The best and most cost-effective means of providing these resources is still open to discussion, however. This report surveys the field of attached scientific processors (array processors) and attempts to indicate their present and possible future use in computational chemistry. Array processors have the possibility of providing very cost-effective computation. This report attempts to provide information that will assist chemists who might be considering the use of an array processor for their computations. It describes the general ideas and concepts involved in using array processors, the commercial products that are available, and the experiences reported by those currently using them. In surveying the field of array processors, the author makes certain recommendations regarding their use in computational chemistry. 5 figures, 1 table (RWR)

Ostlund, N.S.

1980-01-01

370

Systolic Array Fault Tolerance Performance Analysis.  

National Technical Information Service (NTIS)

The reliability performance of six different systolic array fault tolerance techniques are determined and compared in terms of mean time between failure (MTBF). The six techniques include redundant arrays, companion processors, sequential row elimination ...

T. C. Choinski M. H. Leonhardt

1988-01-01

371

Streaming FFT Asynchronously on Graphics Processor Units  

Microsoft Academic Search

The Fast Fourier Transform (FFT), which charactered in memory-access-intensive, follows a divide-and-conquer strategy, is one of the most important and heavily used kernel in scientific computing. The newest generation of Graphics Processor Units (GPUs) implement a stream architecture besides acting as powerful massively parallel coprocessor. Fouthermore, the intruduction of APIs for general-purpose computation on GPUs mades GPUs an attractive choice

Zhao Lili; Shengbing Zhang; Meng Zhang; Zhang Yi

2010-01-01

372

An image processor for SXGA\\/UXGA FPD  

Microsoft Academic Search

We present an image processor for SXGA (super extended graphics array, 1280×1024)\\/UXGA (ultra XGA, 1600×1200) FPD (flat panel display) such as TFT (thin film transistor) LCD (liquid crystal display) and PDP (plasma display panel). The proposed image processor can display the full screen of a FPD with lower or higher resolution of video sources such as NTSC, VGA, SVGA, XGA,

Chul-Ho Choi; Hwa-Hyun Cho; Jong-Seok Chae; Jin-Sung Park; Byong-Heon Kwon; Myung-Ryul Choi

1999-01-01

373

Optical processing for adaptive phased-array radar  

Microsoft Academic Search

Two architectural concepts for optical processors for adaptive phased-array radars (APAR) are discussed. A multichannel coherent correlator and a noncoherent optical vector-matrix processor are described, and their applications to APAR data processing are covered.

D. Casasent; D. Psaltis; B. V. K. Vijaha Kumar; M. Carlotto

1980-01-01

374

Issue Mechanism for Embedded Simultaneous Multithreading Processor  

NASA Astrophysics Data System (ADS)

Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of ‘0’ are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

Zang, Chengjie; Imai, Shigeki; Frank, Steven; Kimura, Shinji

375

Parallel computing and domain decomposition  

SciTech Connect

Domain decomposition techniques appear a natural way to make good use of parallel computers. In particular, these techniques divide a computation into a local part, which may be done without any interprocessor communication, and a part that involves communication between neighboring and distant processors. This paper discusses some of the issues in designing and implementing a parallel domain decomposition algorithm. A framework for evaluating the cost of parallelism is introduced and applied to answering questions such as which and how many processors should solve global problems and what impact load balancing has on the choice of domain decomposition algorithm. The sources of performance bottlenecks are discussed. This analysis suggests that domain decomposition techniques will be effective on high-performance parallel processors and on networks of workstations. 17 refs., 8 figs.

Gropp, W.

1991-01-01

376

Petascale Virtual Machine: Computing on 100, 000 Processors  

Microsoft Academic Search

In the 1990s the largest machines had a few thousand processors and PVM and MPI were key tools to making these machines useable.\\u000a Now with the growing interest in Internet computing and the design of cellular architectures such as IBM’s Blue Gene computer,\\u000a the scale of parallel computing has suddenly jumped to 100,000 processors or more. This talk will describe

Al Geist

2002-01-01

377

Real-time signal processor for pulsar studies  

Microsoft Academic Search

This paper describes the design, tests and preliminary results of a real-time parallel signal processor built to aid a wide\\u000a variety of pulsar observations. The signal processor reduces the distortions caused by the effects of dispersion, Faraday\\u000a rotation, doppler acceleration and parallactic angle variations, at a sustained data rate of 32 Msamples\\/sec. It also folds\\u000a the pulses coherently over the

P. S. Ramkumar; A. A. Deshpande

2001-01-01

378

Soft-core processor study for node-based architectures.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

2008-09-01

379

Two-dimensional optoelectronic interconnect-processor and its operational bit error rate  

NASA Astrophysics Data System (ADS)

Two-dimensional (2-D) multi-channel 8x8 optical interconnect and processor system were designed and developed using complementary metal-oxide-semiconductor (CMOS) driven 850-nm vertical-cavity surface-emitting laser (VCSEL) arrays and the photodetector (PD) arrays with corresponding wavelengths. We performed operation and bit-error-rate (BER) analysis on this free-space integrated 8x8 VCSEL optical interconnects driven by silicon-on-sapphire (SOS) circuits. Pseudo-random bit stream (PRBS) data sequence was used in operation of the interconnects. Eye diagrams were measured from individual channels and analyzed using a digital oscilloscope at data rates from 155 Mb/s to 1.5 Gb/s. Using a statistical model of Gaussian distribution for the random noise in the transmission, we developed a method to compute the BER instantaneously with the digital eye-diagrams. Direct measurements on this interconnects were also taken on a standard BER tester for verification. We found that the results of two methods were in the same order and within 50% accuracy. The integrated interconnects were investigated in an optoelectronic processing architecture of digital halftoning image processor. Error diffusion networks implemented by the inherently parallel nature of photonics promise to provide high quality digital halftoned images.

Liu, J. Jiang; Gollsneider, Brian; Chang, Wayne H.; Carhart, Gary W.; Vorontsov, Mikhail A.; Simonis, George J.; Shoop, Barry L.

2004-10-01

380

Fault tolerance techniques for systolic arrays  

SciTech Connect

Digital systems that are operated in applications where there is a high cost of failure require high reliability and continuous operation. Since it is impossible to guarantee that portions of a system will never fail, such systems need to be designed to tolerate failures of the system components. The discipline of fault-tolerant computing is, therefore, one which has attracted a great deal of research interest. Researchers have attempted to derive highly effective and, at the same time, efficient techniques to tolerate failures in complex digital systems. The high computation needs of many applications can now be met through the use of highly parallel special-purpose systems that can be produced very cost effectively through the use of very large scale integration (VLSI) technology. Systolic arrays, such as the ESL systolic array and the Carnegie Mellon Wrap processor, are examples of such systems.

Abraham, J.A.; Banerjee, P.; Chen, C.Y.; Fuchs, W.K.; Kua, S.Y.; Reddy, A.L.N. (Univ. of Illinois)

1987-07-01

381

Compact optical processor for Hough and frequency domain features  

NASA Astrophysics Data System (ADS)

Shape recognition is necessary in a broad band of applications such as traffic sign or work piece recognition. It requires not only neighborhood processing of the input image pixels but global interconnection of them. The Hough transform (HT) performs such a global operation and it is well suited in the preprocessing stage of a shape recognition system. Translation invariant features can be easily calculated form the Hough domain. We have implemented on the computer a neural network shape recognition system which contains a HT, a feature extraction, and a classification layer. The advantage of this approach is that the total system can be optimized with well-known learning techniques and that it can explore the parallelism of the algorithms. However, the HT is a time consuming operation. Parallel, optical processing is therefore advantageous. Several systems have been proposed, based on space multiplexing with arrays of holograms and CGH's or time multiplexing with acousto-optic processors or by image rotation with incoherent and coherent astigmatic optical processors. We took up the last mentioned approach because 2D array detectors are read out line by line, so a 2D detector can achieve the same speed and is easier to implement. Coherent processing can allow the implementation of tilers in the frequency domain. Features based on wedge/ring, Gabor, or wavelet filters have been proven to show good discrimination capabilities for texture and shape recognition. The astigmatic lens system which is derived form the mathematical formulation of the HT is long and contains a non-standard, astigmatic element. By methods of lens transformation s for coherent applications we map the original design to a shorter lens with a smaller number of well separated standard elements and with the same coherent system response. The final lens design still contains the frequency plane for filtering and ray-tracing shows diffraction limited performance. Image rotation can be done optically by a rotating prism. We realize it on a fast FLC- SLM of our lab as input device. The filters can be implemented on the same type of SLM with 128 by 128 square pixels of size, resulting in a total length of the lens of less than 50cm.

Ott, Peter

1996-11-01

382

Systolic-array architecture for 2D IIR Wideband dual-beam space-time plane-wave filters  

Microsoft Academic Search

A spatio-temporal 2D IIR broadband plane-wave filter having 2 user-selectable passbands is proposed using the concept of 2D network resonance. The plane-wave filter is capable of the highly-selective directional enhancement of 2 far-field plane-waves in the presence of undesired waves at different directions of arrival. A massively-parallel systolic-array processor architecture is proposed for the real-time VLSI implementation of the filter.

Chamith Wijenayake; Arjuna Madanayake; Len T. Bruton

2010-01-01

383

Parallel quicksort algorithm. I. Run time analysis  

SciTech Connect

A general purpose sorting algorithm which is suitable for execution is produced on a parallel computer. The algorithm, which is based on quicksort, does not require a fixed number of processors but may theoretically use as many processors as are available. The analysis of the algorithm reveals that there is a maximum number of processors which can be used for a particular size of set s/sub n/. 6 references.

Evans, D.J.; Dunbar, R.C.

1982-01-01

384

An efficient massively parallel Euler solver for unstructured grids  

NASA Astrophysics Data System (ADS)

A data parallel mesh-vertex upwind finite-volume scheme for solving the Euler equations on triangular unstructured meshes is described. A novel vertex-based partitioning of the problem is introduced which minimizes the computation and communication costs associated with distributing the computation to the processors of a massively parallel computer. Finally, the performance of this unstructured computation on 8K processors of the Connection Machine CM-2 is compared with one processor of a Cray-YMP. The experiments show that 8K processors of the CM-2 achieve approximately 70 percent of the performance of one processor of the Cray-YMP on the unstructured mesh computations described here.

Hammond, Steven W.; Barth, Timothy J.

1991-01-01

385

Load balancing for parallel forwarding  

Microsoft Academic Search

Workload distribution is critical to the performance of network processor based parallel forwarding systems. Scheduling schemes that operate at the packet level, e.g., round-robin, cannot preserve packet-ordering within individual TCP connections. Moreover, these schemes create duplicate information in processor caches and therefore are inefficient in resource utilization. Hashing operates at the flow level and is naturally able to maintain per-connection

Weiguang Shi; M. H. MacGregor; Pawel Gburzynski

2005-01-01

386

NWChem: scalable parallel computational chemistry  

SciTech Connect

NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

2011-11-01

387

Cost-Effective Parallel Computing  

Microsoft Academic Search

Many academic papers imply that parallel computing is only worthwhile when applications achieve nearly linear speedup (i.e., execute nearly p times faster on p processors). This note shows that parallel computing is cost-effective whenever speedup exceeds costup---the parallel system cost divided by uniprocessor cost. Furthermore, when applications have large memory requirements (e.g., 512 megabytes), the costup---and hence speedup necessary to

David A. Wood; Mark D. Hill

1995-01-01

388

Fusion processor simulation (FPSim)  

Microsoft Academic Search

The Fusion Processor Simulation (FPSim) is being developed by Rome Laboratory to support the Discrimination Interceptor Technology (DITP) and Advanced Sensor Technology (ASTP) Programs of the Ballistic Missile Defense Organization. The purpose of the FPSim is to serve as a test bed and evaluation tool for establishing the feasibility of achieving threat engagement timelines. The FPSim supports the integration, evaluation,

Mark D. Barnell; Douglas G. Wynne; Brian J. Rahn

1998-01-01

389

Rotary pipeline processors  

Microsoft Academic Search

The rotary pipeline processor is a new architecture for su- perscalar computing. It is based on a simple and regular pipeline structure which can support several ALUs for effi- cient dispatching of multiple instructions. Register values flow around a rotary pipeline, constrained by local data de- pendencies. During normal operation the control circuits are not on the critical path and

Simon Moore; Peter Robinson; Steve Wilcox

1996-01-01

390

AFOS Word Processor.  

National Technical Information Service (NTIS)

TYPEWRITER.FR is a program which adds a new dimension to AFOS--that of a word processor. By observing a few simple rules of syntax, written correspondence not requiring a letterhead can be composed and edited at an ADM in an AFOS product set aside for thi...

M. S. Webb

1981-01-01

391

Parallel symmetry-breaking in sparse graphs  

Microsoft Academic Search

We describe efficient deterministic techniques for breaking symmetry in parallel. The techniques work well on rooted trees and graphs of constant degree or genus. Our primary technique allows us to 3-color a rooted tree in &Ogr;(lg*n) time on an EREW PRAM using a linear number of processors. We apply these techniques to construct fast linear processor algorithms for several problems,

Andrew V. Goldberg; Serge A. Plotkint; Gregory E. Shannon

1987-01-01

392

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore  

SciTech Connect

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

Liao, C; Quinlan, D J; Willcock, J J; Panas, T

2008-12-12

393

Multimedia extensions for DLX processor  

Microsoft Academic Search

In recent years, the success of Internet and World Wide Web, and the growing feasibility of image and video compression techniques have pushed multimedia into mainstream computing. These requirements necessitate new and modified hardware architectures enabling realtime multimedia applications. Three methods have been proposed for enhancing multimedia architectures namely dedicated processors, media processors and multimedia extensions for general-purpose processors. Multimedia

Elham Khorsandi Nia; Omid Fatemi

2003-01-01

394

High Performance Data Processor (HPDP)  

Microsoft Academic Search

With the increasing use of remote sensing and earth observation technologies, the large amount of data collected onboard requires high performance and fast processing hardware. Also the flexibility and processing requirements for regenerative processor payloads are of a magnitude larger than those which could be successfully handled by classical processors. Currently available processors for space cannot adapt to changing communication

M. A. Syed; E. Schueler

2008-01-01

395

Precise Exceptions in Asynchronous Processors  

Microsoft Academic Search

The presence of precise exceptions in a processor leads to co mplications in its design. Some re- cent processor architectures have sacrificed this requirem ent for performance reasons at the cost of software complexity. We present an implementation strateg y for precise exceptions in asynchronous processors that does not block the instruction fetch when ex ceptions do not occur; the

Rajit Manohar; Mika Nyström; Alain J. Martin

2001-01-01

396

Scheduling for Speed Bounded Processors  

Microsoft Academic Search

We consider online scheduling algorithms in the dynamic speed scal- ing model, where a processor can scale its speed between 0 and some maximum speed T. The processor uses energy at rate sfi when run at speed s, where fi > 1 is a constant. Most modern processors use dynamic speed scaling to manage their energy usage. This leads to

Nikhil Bansal; Ho-leung Chan; Tak-Wah Lam; Lap-kei Lee

2008-01-01

397

Array radars - An update. II  

NASA Astrophysics Data System (ADS)

Research aimed at improving array radars is reviewed. Advances in MMICs, the use of HEMT low noise amplifiers for analog and digital circuitry, the application of VHSIC chips to the programmable signal processor of the F-16 airborne fire control radar, Si compiler language, memory chips, and GHz and GaAs logic are discussed. Consideration is given to CMOS gate arrays, floating point chips, a single-chip digital signal processor, systolic array architectures, radiation hardened chips, digital beamforming, distributed beamsteering computers, fiber optics, flat low voltage displays, and adaptive-adaptive array processing.

Brookner, Eli

1987-03-01

398

Simulation and Test of AN Optical Matrix-Vector Processor.  

NASA Astrophysics Data System (ADS)

This dissertation describes research in the computer simulation and the experimental laboratory evaluation of optical matrix-vector (linear algebra) processors. A single optical linear algebraic processing architecture is used for both the simulations and the laboratory implementation. The case study solved by the processor is a linear dynamic structural analysis finite element problem. The response of a plane frame structure under earthquake loading is investigated. The laboratory optical processor utilizes a new AC-coupled modulation technique which eliminates thermal problems discovered in previous laboratory work. The processor uses laser diodes, a multi-channel acousto-optic Bragg cell, and a multi-channel linear detector array and wide-band detector amplifiers. Simplified optical preocessor error source models are developed to simulate the optical processor. The error model simplifications ease the computational requirements and reduce the complexity of the simulations. The error source levels are determined for the laboratory optical processor, which is used to verifty the validity of the error source models. The case study is run on the laboratory optical processor and its operation is evaluated. We find that the AC-coupled modulation technique is extremely useful for eliminating detector thermal effects. The case study is solved successfully on the optical system. Laboratory optical processor experiments and measurements verify that the error source model simulator accurately predicts the performance of the laboratory processor. Laboratory and simulation results are analyzed, and various critical processor fabrication issues are then detailed. Extensions of the laboratory system to larger size are discussed, and comments on potential improvements with the latest technology are advance.

Taylor, Bradley Keith

1988-12-01

399

Micromechanical resonator array for an implantable bionic ear.  

PubMed

In this paper we report on a multi-resonant transducer that may be used to replace a traditional speech processor in cochlear implant applications. The transducer, made from an array of micro-machined polymer resonators, is capable of passively splitting sound into its frequency sub-bands without the need for analog-to-digital conversion and subsequent digital processing. Since all bands are mechanically filtered in parallel, there is low latency in the output signals. The simplicity of the device, high channel capability, low power requirements, and small form factor (less than 1 cm) make it a good candidate for a completely implantable bionic ear device. PMID:16439832

Bachman, Mark; Zeng, Fan-Gang; Xu, Tao; Li, G-P

2006-01-17

400

Massively Parallel Two-Dimensional TLM Algorithm on Graphics Processing Units  

Microsoft Academic Search

Recent advances in computing technology has brought massively parallel computing power to desktop PCs. As multi-core processor technology becomes mature, a new front in parallel technology based on graphics processors has emerged. A massively parallel 2D-TLM algorithm for NVIDIA advanced graphics processors has been developed. The proposed parallel computing paradigm can be adopted straightforwardly to accelerate time-domain electromagnetic field modeling

Filippo V. Rossi; Nikolaus Fichtner; Peter Russer

2009-01-01

401

Massively parallel two-dimensional TLM algorithm on graphics processing units  

Microsoft Academic Search

Recent advances in computing technology has brought massively parallel computing power to desktop PCs. As multi-core processor technology becomes mature, a new front in parallel technology based on graphics processors has emerged. A massively parallel 2D-TLM algorithm for NVIDIA advanced graphics processors has been developed. The proposed parallel computing paradigm can be adopted straightforwardly to accelerate time-domain electromagnetic field modeling

Filippo V. Rossi; P. P. M. So; N. Fichtner; P. Russer

2008-01-01

402

High Performance Parallel Computing.  

National Technical Information Service (NTIS)

The accomplishments of the research project 'High Performance Parallel Computing' for the year 1983 span algorithm formulation, paralle programming languages, basic software for the Texas Reconfigurable Array Computer and validation of design concpets for...

J. C. Browne G. J. Lipovski M. Malek

1985-01-01

403

Parallelization of Hydrocodes on the Intel Hypercube.  

National Technical Information Service (NTIS)

This report describes the Intel hydrocode parallelization project. The project consists of running four solution methods on two test problems. The code used is a modified version of one written for a project on the Heterogeneous Element Processor (HEP). I...

D. L. Hicks L. M. Liebrock V. A. Mousseau G. A. Mortensen

1987-01-01

404

Problem Size, Parallel Architecture and Optimal Speedup.  

National Technical Information Service (NTIS)

The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optima...

D. M. Nicol F. H. Willard

1987-01-01

405

Massively parallel mathematical sieves  

SciTech Connect

The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.

Montry, G.R.

1989-01-01

406

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors  

Microsoft Academic Search

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of these processors. These languages typically involve the control code running on the CPU and the performance-critical, data-parallel kernel code running on the

Jayanth Gummaraju; Laurent Morichetti; Michael Houston; Ben Sander; Benedict R. Gaster; Bixia Zheng

2010-01-01

407

BitSNAP: Dynamic Significance Compression for a Low-Energy Sensor Network Asynchronous Processor  

Microsoft Academic Search

We present a novel asynchronous processor architec- ture called BitSNAP that utilizes bit-serial datapaths wit h dynamic significance compression to yield extremely low- energy consumption. Based on the Sensor Network Asyn- chronous Processor (SNAP) ISA, BitSNAP can reduce dat- apath energy consumption by 50% over a comparable parallel-word processor, while still providing performan ce suited for powering low-energy sensor network

Virantha N. Ekanayake; Clinton Kelly Iv; Rajit Manohar

2005-01-01

408

A generalization of Amdahl's law and relative conditions of parallelism  

Microsoft Academic Search

In this work I present a generalization of Amdahl's law on the limits of a parallel implementation with many processors. In particular I establish some mathematical relations involving the number of processors and the dimension of the treated problem, and with these conditions I define, on the ground of the reachable speedup, some classes of parallelism for the implementations. I

Gianluca Argentini

2002-01-01

409

Parallel logic simulation on general purpose machines  

Microsoft Academic Search

Three parallel algorithms for logic simulation have been developed and implemented on a general purpose shared-memory parallel machine. The first algorithm is a synchronous version of a traditional event-driven algorithm which achieves speed-ups of 6 to 9 with 15 processors. The second algorithm is a synchronous unit-delay compiled mode algorithm which achieves speed-ups of 10 to 13 with 15 processors.

Larry Soulé; Tom Blank

1988-01-01

410

Enhancing Scalability of Parallel Structured AMR Calculations  

SciTech Connect

This paper discusses parallel scaling performance of large scale parallel structured adaptive mesh refinement (SAMR) calculations in SAMRAI. Previous work revealed that poor scaling qualities in the adaptive gridding operations in SAMR calculations cause them to become dominant for cases run on up to 512 processors. This work describes algorithms we have developed to enhance the efficiency of the adaptive gridding operations. Performance of the algorithms is evaluated for two adaptive benchmarks run on up 512 processors of an IBM SP system.

Wissink, A M; Hysom, D; Hornung, R D

2003-02-10

411

CNET's Ultimate Processor Guide  

NSDL National Science Digital Library

This new special report from CNET is offered as a guide to "everything you need to know about processors." Users shopping for a new PC will want to start with the Know Your CPU section, which offers a set of flash cards on the most popular processors. Those interested in the higher-end chips will want to read the Athlon vs. Pentium III Shoot-Out section, which compares features and performance.The more curious and technical-minded will also be interested in the Supersonic Chips section, which explains what makes chips fast and speculates on future speeds. Finally, the DIY user may want to review Upgrading Your CPU and The Trouble With Overclocking, the latter of which explores the pros and cons of overclocking CPUs.

Blachere, Kristina.

412

Photodetector arrays for optical processing  

Microsoft Academic Search

The present discussion of design and performance issues emerging from the dynamic range requirements placed by optical processors on photodetector arrays emphasizes time-integrating photodetector arrays. The two most common photodetector element types used in time-integrating arrays are the N+\\/P and photogate detector types. Attention is given to detector array developments for optical processing which have yielded 1-2 orders-of-magnitude improvements, relative

Paul P. Suni

1989-01-01

413

Tiled Multicore Processors  

Microsoft Academic Search

For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single\\u000a chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled\\u000a multicore architectures combine each processor core with a switch to create a

Michael B. Taylor; Walter Lee; Jason E. Miller; David Wentzlaff; Ian Bratt; Ben Greenwald; Henry Hoffmann; Paul R. Johnson; Jason S. Kim; James Psota; Arvind Saraf; Nathan Shnidman; Volker Strumpen; Matthew I. Frank; Saman Amarasinghe; Anant Agarwal

2009-01-01

414

Iterative color-multiplexed, electro-optical processor.  

PubMed

A noncoherent optical vector-matrix multiplier using a linear LED source array and a linear P-I-N photodiode detector array has been combined with a 1-D adder in a feedback loop. The resultant iterative optical processor and its use in solving simultaneous linear equations are described. Operation on complex data is provided by a novel color-multiplexing system. PMID:19687900

Psaltis, D; Casasent, D; Carlotto, M

1979-11-01

415

Interprocedural Analysis for Parallelization  

Microsoft Academic Search

This paper presents an extensive empirical evaluation of an interprocedural parallelizing compiler, developed as part of the Stanford SUIF compiler system. The system incorporates a comprehensive and integrated collection of analyses, including privatization and reduction recognition for both array and scalar variables, and symbolic analysis of array subscripts. The interprocedural analysis framework is designed to provide analysis results nearly as

Mary W. Hallt; Brian R. Murphy; Saman P. Amarasinghe; Shih-wei Liao; Monica S. Lam

1995-01-01

416

The Processor Working Set and Its Use in Scheduling Multiprocessor Systems  

Microsoft Academic Search

The concept of a processor working set (PWS) as a single value parameter for characterizing the parallel program behavior is introduced. Through detailed experimental studies of different algorithms on a transputer-based multiprocessor machine, it is shown that the PWS is a robust measure for characterizing the workload of a multiprocessor system. It is shown that processor allocation strategies based on

Dipak Ghosal; Giuseppe Serazzi; Satish K. Tripathi

1991-01-01

417

Run-time versus compile-time instruction scheduling in superscalar (RISC) processors: performance and tradeoffs  

Microsoft Academic Search

The RISC revolution has spurred the development of processors with increasing degrees of instruction level parallelism (ILP). In order to realize the full potential of these processors, multiple instructions must continuously be issued and executed in a single cycle. Consequently, instruction scheduling plays a crucial role as an optimization in this context. While early attempts at instruction scheduling were limited

Allen Leung; Krishna V. Palem; Cristian Ungureanu

1996-01-01

418

Promise in Impermanence: Children Writing with Unlimited Access to Word Processors.  

ERIC Educational Resources Information Center

|A 2-year study examined writing skills development of 11- and 12-year olds with unlimited access to word processors. Samples of the 22 subjects' narrative writing were compared with samples from a parallel class that used hand writing methods. Results indicated that the children using word processors produced better quality writing than the…

Breese, Chris

1996-01-01

419

Parallel Genetic Algorithm for Alpha Spectra Fitting  

NASA Astrophysics Data System (ADS)

We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.

García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio

2005-01-01

420

Scalable Parallel Crash Simulations  

SciTech Connect

We are pleased to submit our efforts in parallelizing the PRONTO application suite for con- sideration in the SuParCup 99 competition. PRONTO is a finite element transient dynamics simulator which includes a smoothed particle hydrodynamics (SPH) capability; it is similar in scope to the well-known DYNA, PamCrash, and ABAQUS codes. Our efforts over the last few years have produced a fully parallel version of the entire PRONTO code which (1) runs fast and scalably on thousands of processors, (2) has performed the largest finite-element transient dynamics simulations we are aware of, and (3) includes several new parallel algorithmic ideas that have solved some difficult problems associated with contact detection and SPH scalability. We motivate this work, describe the novel algorithmic advances, give performance numbers for PRONTO running on Sandia's Intel Teraflop machine, and highlight two prototypical large-scale computations we have performed with the parallel code. We have successfully parallelized a large-scale production transient dynamics code with a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations. To be able to simulate a more than ten million element model in a few tenths of second per timestep is unprecedented for solid dynamics simulations, especially when full global contact searches are required. The key reason is our new algorithmic ideas for efficiently parallelizing the contact detection stage. To our knowledge scalability of this computation had never before been demonstrated on more than 64 processors. This has enabled parallel PRONTO to become the only solid dynamics code we are aware of that can run effectively on 1000s of processors. More importantly, our parallel performance compares very favorably to the original serial PRONTO code which is optimized for vector supercomputers. On the container crush problem, a Teraflop node is as fast as a single processor of the Cray Jedi. This means that on the Teraflop machine we can now run simulations with tens of millions of elements thousands of times faster than we could on the Jedi! This is enabling transient dynamics simulations of unprecedented scale and fidelity. Not only can previous applications be run with vastly improved resolution and speed, but qualitatively new and different analyses have been made possible.

Attaway, Stephen; Barragy, Ted; Brown, Kevin; Gardner, David; Gruda, Jeff; Heinstein, Martin; Hendrickson, Bruce; Metzinger, Kurt; Neilsen, Mike; Plimpton, Steve; Pott, John; Swegle, Jeff; Vaughan, Courtenay

1999-06-01

421

Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems  

Microsoft Academic Search

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none address the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel

James S. Plank; Michael G. Thomason

2001-01-01

422

Guarded execution and branch prediction in dynamic ILP processors  

Microsoft Academic Search

We evaluate the effects of guarded (or conditional, or predicated) execution on the performance of an instruction level parallel processor employing dynamic branch prediction. First, we assess the utility of guarded execution, both qualitatively and quantitatively, using a variety of application programs. Our assessment shows that guarded execution significantly increases the opportunities, for both compiler and dynamic hardware, to extract

Dionisios N. Pnevmatikatos; Gurindar S. Sohi

1994-01-01

423

Parallel-computing structures for adaptive maximum-likelihood receivers  

SciTech Connect

Bandwidth-efficient digital data transmission over telephone and radio channels is significantly improved by the use of adaptive equalization. Among the numerous adaptive equalizer and receiver structures developed during the last two decades, adaptive maximum-likelihood receivers have emerged as front runners with respect to error-rate performance. However, the high degree of computational complexity of the optimum maximum-likelihood receivers has prohibited their use in many applications. This dissertation presents a study of parallel-computing structures that provide high computation throughput for implementation of adaptive maximum-likelihood receivers. Based on systolic array concepts, a two-dimensional array implementation of the Viterbi processor for adaptive maximum-likelihood receivers is presented. The array computes state transition metrics and survivor metric table addresses in a highly concurrent fashion. All interprocessor data flow and interconnections within the array are nearest-neighbor. A number of variations in the array design are described that enhance its versatility. A high-bandwidth memory interface for the survivor metric table memory is proposed.

Provence, J.D.

1987-01-01

424

Waste from food processors  

SciTech Connect

Food processing companies, by nature of the commodities they deal in and the products they provide, generate a much higher percentage of biodegradable, organic wastes than they do nonorganic wastes. The high percentage of food materials, and to a lesser extent, paper, found in a food processor's waste stream makes composting a highly cost-effective way to manage the wastes. This is the last in a series of articles that discussed solid waste management in various public arenas. Each segment highlighted particulars -- the waste stream; how the waste is handled; waste reduction and recovery programs; and the direction of future waste management -- that are specific to that area.

Sheehan, K.

1993-12-01

425

Electrostatically focused addressable field emission array chips (AFEA's) for high-speed massively parallel maskless digital E-beam direct write lithography and scanning electron microscopy  

Microsoft Academic Search

Systems and methods are described for addressable field emission array (AFEA) chips. A method of operating an addressable field-emission array, includes: generating a plurality of electron beams from a pluralitly of emitters that compose the addressable field-emission array; and focusing at least one of the plurality of electron beams with an on-chip electrostatic focusing stack. The systems and methods provide

Clarence E. Thomas; Larry R. Baylor; Edgar Voelkl; Michael L. Simpson; Michael J. Paulus; Douglas H. Lowndes; John H. Whealton; John C. Whitson; John B. Wilgen

2002-01-01

426

Parallel performance of the fine-grain pipeline FPGA image processing system  

NASA Astrophysics Data System (ADS)

The use of FPGA circuits in imaging systems increases. They compete with other computing environments. The article describes the indications to be followed while choosing the type of image processing computing system taking under consideration the advantages and disadvantages of each technology: general purpose processor, digital signal processor, graphical processing unit, application specific Integrated circuit and field programmable gate array. Attention is drawn to various video transmission standards. The state of research and development trends in the field of FPGA-based image processing are briefly presented. A defining processing performance method for image processing is proposed. It is proven that for a pipeline architecture implemented in FPGA, a linear speedup is achieved and parallel efficiency is equal to one.

Gorgo?, M.

2012-06-01

427

A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC (Superconducting Super Collider) detectors  

SciTech Connect

A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab.

Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C. (Fermi National Accelerator Lab., Batavia, IL (USA)); Lockyer, N.; VanBerg, R. (Pennsylvania Univ., Philadelphia, PA (USA))

1989-12-01

428

A Preliminary Investigation into Parallel Routing on a Hypercube Computer  

Microsoft Academic Search

This paper describes an experiment in which parallel routing is performed on a medium grained hypercube parallel processor having 64 processing elements. Each node is a complete 32-bit computer with 128 K-bytes of memory and is connected to the other nodes via a direct hypercube interconnection network. A new parallel routing algorithm was developed to exploit this parallel structure. It

O. A. Olukotun; T. N. Mudge

1987-01-01

429

A preliminary investigation into parallel routing on a hypercube computer  

Microsoft Academic Search

This paper describes an experiment in which parallel routing is performed on a medium grained hypercube parallel processor having 64 processing elements. Each node is a complete 32-bit computer with 128 K-bytes of memory and is connected to the other nodes via a direct hypercube interconnection network. A new parallel routing algorithm was developed to exploit this parallel structure. It

O. A. Olukotun; Trevor N. Mudge

1987-01-01

430

Scheduling Parallel Applications Using Malleable Tasks on Clusters  

Microsoft Academic Search

Scheduling is a central issue for implementing applications on parallel and distributed systems. This problem has been intensively studied for conventional parallel systems. Clusters of SMP (symmetric Multi-Processors) are a cost effective alternative to parallel supercomputers which are more and more popular. New characteristics are influencing the execution of parallel applications, like for instance the hierarchical structure and the heterogeneity

Denis Trystram

2001-01-01

431

Fusion processor simulation (FPSim)  

NASA Astrophysics Data System (ADS)

The Fusion Processor Simulation (FPSim) is being developed by Rome Laboratory to support the Discrimination Interceptor Technology (DITP) and Advanced Sensor Technology (ASTP) Programs of the Ballistic Missile Defense Organization. The purpose of the FPSim is to serve as a test bed and evaluation tool for establishing the feasibility of achieving threat engagement timelines. The FPSim supports the integration, evaluation, and demonstration of different strategies, system concepts, and Acquisition Tracking & Pointing (ATP) subsystems and components. The environment comprises a simulation capability within which users can integrate and test their application software models, algorithms and databases. The FPSim must evolve as algorithm developments mature to support independent evaluation of contractor designs and the integration of a number of fusion processor subsystem technologies. To accomplish this, the simulation contains validated modules, databases, and simulations. It possesses standardized engagement scenarios, architectures and subsystem interfaces, and provides a hardware and software framework which is flexible to support growth, reconfigurration, and simulation component modification and insertion. Key user interaction features include: (1) Visualization of platform status through displays of the surveillance scene as seen by imaging sensors. (2) User-selectable data analysis and graphics display during the simulation execution as well as during post-simulation analysis. (3) Automated, graphical tools to permit the user to reconfigure the FPSim, i.e., 'Plug and Play' various model/software modules. The FPSim is capable of hosting and executing user's software algorithms of image processing, signal processing, subsystems, and functions for evaluation purposes.

Barnell, Mark D.; Wynne, Douglas G.; Rahn, Brian J.

1998-07-01

432

Parallelization of FM-Index  

Microsoft Academic Search

A parallel design and implementation of FM-index is presented in this paper. In applications, the performance of the FM-index is crucial, which is a self-contained, highly compressed indexing algorithm. With the popularity of multi-core processors, parallel computing allows the FM-index to run faster by performing multiple computations simultaneously when possible. Our approach works by splitting input data into overlapping blocks

Di Zhang; Yunquan Zhang; Shengfei Liu; Xiaodi Huang

2008-01-01

433

A parallel string search algorithm  

Microsoft Academic Search

A new parallel processing algorithm for solving string search problems is presented. The proposed algorithm uses O(m×n) processors where n is the length of a text and m is the length of a pattern. It requires two and only two iteration steps to find the pattern in the text, while the best existing parallel algorithm needs the computation time O(loglog

Yoshiyasu Takefuji; Toshimitsu Tanaka; Kuo Chun Lee

1992-01-01

434

Adaptive Explicitly Parallel Instruction Computing  

Microsoft Academic Search

Poor scalability of Superscalar architectures with increasinginstruction-level parallelism (ILP) has resulted in a trend towards staticallyscheduled horizontal architectures such as Very Large Instruction Word(VLIW) processors and their more sophisticated successors called ExplicitlyParallel Instruction Computing (EPIC) architectures. We extend the EPiCmodel with additional capabilities to reconfigure the datapath at runtimein terms of the number and types of functional units...

Krishna V. Palem; Surendranath Talla; Patrick W. Devaney

1999-01-01

435

Morphology-based processor for real-time enhancement  

NASA Astrophysics Data System (ADS)

This report describes an image enhancement processor which uses mathematical morphology to improve contrast. Our goal is the realtime enhancement of cardiac angiograms and the near- realtime enhancement of peripheral angiograms (e.g., arms and legs.) It consists of two systolic arrays for morphological processing and an image combination unit for simple pixel by pixel arithmetic and logical operations. The arrays and image combination unit will be integrated into a standard framework (Datacube MAXbus) for image capture, storage, and display. The rest of this report is structured in the following way: The requirements, basic enhancement technique and processor structure are given in the next section. Then the design of the systolic arrays, image combination, and delay units are presented followed by a discussion of the Datacube image processing framework. An organization level simulation was developed to validate the systolic design and results are summarized. Finally, conclusions are made and open issues are discussed.

Drongowski, Paul J.; Andress, Keith M.

1992-11-01

436

PARALLEL STRINGS - PARALLEL UNIVERSES  

Microsoft Academic Search

Sometimes different parts of the battery community just don't seem to operate on the same level, and attitudes towards parallel battery strings are a prime example of this. Engineers at telephone company central offices are quite happy operating 20 or more parallel strings on the same dc bus, while many manufacturers warn against connecting more than four or five strings

Jim McDowall; Saft America

437

Design of free space interconnected signal processor  

NASA Astrophysics Data System (ADS)

Progress is described on a collaborative effort between the Photonics Center at Rome Laboratory (RL), Griffiss AFB and Rutgers University, through the RL Expert Science and Engineering (ES&E) program. The goal of the effort is to develop a prototype random access memory (RAM) that can be used in a signal processor for a computing model that consists of cascaded arrays of optical logic gates interconnected in free space with regular patterns. The effort involved the optical and architectural development of a cascadable optical logic system in which microlaser pumped S-SEED devices serve as logic gates. At the completion of the contract, two gate-level layouts of the module were completed which were created in collaboration with RL personnel. The basic layout of the optical system has been developed, and key components have been tested. The delayed delivery of microlaser arrays precluded completion of the processor during the contract period, but preliminary testing was made possible through the use of other microlaser devices.

Murdocca, Miles; Stone, Thomas

1993-12-01

438

Coarray Fortran for parallel programming  

Microsoft Academic Search

Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with

Robert W. Numrich; John Reid

1998-01-01

439

Efficient parallel algorithm for hierarchical block-matching motion estimation  

NASA Astrophysics Data System (ADS)

Motion estimation is an integral part of most of the video coding schemes that have been proposed in the literature. It is also the most computationally intensive part in these schemes and thus is usually implemented on high performance parallel architectures. In this paper, we deal with a multiresolution (hierarchical) block matching motion estimation algorithm. Specifically, we parallelize this algorithm on a hypercube based multiprocessor. As this algorithm presents a non regular data flow, it could not be easily implemented on systolic arrays. In contrast, the use of such an advanced network as the hypercube overcomes the problem of the non regular data flow, thereby providing high performance. Another important point in our study is that our multiprocessor is assumed to be fine grained unlike most of multiprocessors that has been proposed for video coding schemes. The constraint of limited local memory in each processor leads to frequent interprocessor communication and thus the employed techniques should be carefully selected in order to lower the communication overhead. Coarse grained architectures do not have this kind of problem because each processor can take most of the data it will need throughout the algorithm execution from the beginning. This greatly reduces the communication overhead, and thus the algorithm design is rather straightforward in this case.

Konstantopoulos, Charalampos; Svolos, Andreas E.; Kaklamamis, Christos

1998-12-01

440

Efficient Breadth-First Search on the Cell/BE Processor  

SciTech Connect

Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But multi-core processors also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a breadth-first search (BFS) for advanced multi-core processors. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with a low-level implementation that embeds processor-specific optimizations. Using a fine-graind global coordination strategy derived by the Bulk-Synchronous Parallel (BSP) model we have determined an accurate performance model that has guided the implementation and the optimization of our algorithms. To validate our approach, we use a state-of-the-art multicore processor, the Cell Broadband Engine (Cell BE). Our experiments, obtained on a pre-production Cell BE board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. The Cell BE is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, an order of magnitude faster than the MTA-2 multi-threaded processor, and two orders of magnitude faster than a BlueGene/L processor. Index Terms—Multi-core processors, Parallel Computing, Cell Broadband Engine, Parallelization Techniques, Graph Exploration Algorithms, Breadth-First Search, BFS.

Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

2008-10-01

441

A Phase Preserving Sar Processor  

Microsoft Academic Search

Synthetic aperture radar (SAR) image phase information IS necessary to support many advanced SAR applications. The phase information in the complex image for conventional range­ Doppler processors is not a robust estimate of scene phase . A SAR processor specifically designed to preserve phase informa­ tion is being developed at the Canada Centre for Remote Sens­ ing (CCRS). In addition

R. Keith Raney; Paris W. Vachon

1989-01-01

442

Never Trust Your Word Processor  

ERIC Educational Resources Information Center

|In this article, the author talks about the auto correction mode of word processors that leads to a number of problems and describes an example in biochemistry exams that shows how word processors can lead to mistakes in databases and in papers. The author contends that, where this system is applied, spell checking should not be left to a word…

Linke, Dirk

2009-01-01

443

Never Trust Your Word Processor  

ERIC Educational Resources Information Center

In this article, the author talks about the auto correction mode of word processors that leads to a number of problems and describes an example in biochemistry exams that shows how word processors can lead to mistakes in databases and in papers. The author contends that, where this system is applied, spell checking should not be left to a word…

Linke, Dirk

2009-01-01

444

Parallel superconvergent multigrid  

SciTech Connect

We describe a class of multiscale algorithms for the solution of large sparse linear systems that are particularly well adapted to massively parallel supercomputers. While standard multigrid algorithms are unable to effectively use all processors when computing on coarse grids, the new algorithms utilize the same number of processors at all times. The basic idea is to solve many coarse scale problems simultaneously, combining the results in an optimal way to provide an improved fine scale solution. As a result, convergence rates are much faster than for standard multigrid methods - we have obtained V-cycle convergence rates as good as .0046 with one smoothing application per cycle, and .0013 with two smoothings. On massively parallel machines the improved convergence rate is attained at no extra computational cost since processors that would otherwise be sitting idle are utilized to provide the better convergence. On serial machines the algorithm is slower because of the extra time spent on multiple coarse scales, though in certain cases the improved convergence rate may justify this - particularly in cases where other methods do not converge. In constant coefficient situations the algorithm is easily analyzed theoretically using Fourier methods on a single grid. The fact that only one grid is involved substantially simplifies convergence proofs. A feature of the algorithms is the use of a matched pair of operators: an approximate inverse for smoothing and a superinterpolation operator to move the correction from coarse to fine scales, chosen to optimize the rate of convergence.

Frederickson, P.O.; McBryan, O.A.

1987-01-01

445

Efficient Spare Allocation for Reconfigurable Arrays  

Microsoft Academic Search

Yield degradation from physical failures in large memories and processor arrays is of significant concern to semiconductor manufacturers. One method of increasing the yield for iterated arrays of memory cells or processing elements is to incorporate spare rows and columns in the die or wafer. These spare rows and columns can then be programmed into the array. The authors discuss

Sy-Yen Kuo; W. K. Fuchs

1987-01-01

446

HSRA: high-speed, hierarchical synchronous reconfigurable array  

Microsoft Academic Search

There is no inherent characteristic forcing Field ProgrammableGate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Mod- ern FPGAs seldom achieve application clock rates close to their processor cousins because (1) resources in the FPGAs are not bal- anced appropriately for high-speed operation, (2) FPGA CAD does not automatically provide

William Tsu; Kip Macy; Atul Joshi; Randy Huang; Norman Walker; Tony Tung; Omid Rowhani; Varghese George; John Wawrzynek; André DeHon

1999-01-01

447

Reconfigurable pipelined processor  

SciTech Connect

This patent describes a reconfigurable pipelined processor for processing data. It comprises: a plurality of memory devices for storing bits of data; a plurality of arithmetic units for performing arithmetic functions with the data; cross bar means for connecting the memory devices with the arithmetic units for transferring data therebetween; at least one counter connected with the cross bar means for providing a source of addresses to the memory devices; at least one variable tick delay device connected with each of the memory devices and arithmetic units; and means for providing control bits to the variable tick delay device for variably controlling the input and output operations thereof to selectively delay the memory devices and arithmetic units to align the data for processing in a selected sequence.

Saccardi, R.J.

1989-09-19

448

Computerized audio processor  

NASA Astrophysics Data System (ADS)

The Computerized Audio Processor (CAP) is a computer synthesized electronic filter that removes interference from received or recorded speech signals. The CAP automatically detects and attenuates impulse sounds and tones (e.g., ignition noise, switching transients, whistles, chirps, hum, buzzes, FSK telegraphy, etc). It also attenuates wideband random noise. All operations of the CAP are fully automatic. Input signals are processed in real time, with a maximum lag of 340 msec. The CAP implements three proven signal processing techniques. One of these (IMP) virtually eliminates most loud impulse noises. A second technique (DSS) automatically detects tones and attenuates them by up to 46 dB. The third technique (INTEL) provides up to 18 dB attenuation of wideband random noise.

Weiss, M. R.; Aschkenasy, E.

1983-05-01

449

The Honeywell Experimental Distributed Processor – an Overview  

Microsoft Academic Search

The Honeywell Experimental Distributed Processor (HXDP) is a vehicle for research in the science and engineering of processor interconnection, executive control, and user software for a certain class of multiple-processor computers which we call \\

E. D. Jensen

1978-01-01

450

Parallel processing architecture for H.264 deblocking filter on multi-core platforms  

NASA Astrophysics Data System (ADS)

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-02-01

451

Highly scalable linear solvers on thousands of processors.  

SciTech Connect

In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

2009-09-01

452

A Parallel Processing Algorithm for Gravity Inversion  

NASA Astrophysics Data System (ADS)

The paper presents results of using MPI parallel processing for the 3D inversion of gravity anomalies. The work is done under the FP7 project HP-SEE (http://www.hp-see.eu/). The inversion of geophysical anomalies remains a challenge, and the use of parallel processing can be a tool to achieve better results, "compensating" the complexity of the ill-posed problem of inversion with the increase of volume of calculations. We considered the gravity as the simplest case of physical fields and experimented an algorithm based in the methodology known as CLEAN and developed by Högbom in 1974. The 3D geosection was discretized in finite cuboid elements and represented by a 3D array of nodes, while the ground surface where the anomaly is observed as a 2D array of points. Starting from a geosection with mass density zero in all nodes, iteratively the algorithm defines the 3D node that offers the best anomaly shape that approximates the observed anomaly minimizing the least squares error; the mass density in the best 3D node is modified with a prefixed density step and the related effect subtracted from the observed anomaly; the process continues until some criteria is fulfilled. Theoretical complexity of he algorithm was evaluated on the basis of iterations and run-time for a geosection discretized in different scales. We considered the average number N of nodes in one edge of the 3D array. The order of number of iterations was evaluated O(N^3); and the order of run-time was evaluated O(N^8). We used several different methods for the identification of the 3D node which effect offers the best least squares error in approximating the observed anomaly: unweighted least squares error for the whole 2D array of anomalous points; weighting least squares error by the inverted value of observed anomaly over each 3D node; and limiting the area of 2D anomalous points where least squares are calculated over shallow 3D nodes. By comparing results from the inversion of single body and two-bodies geosections, it was concluded that limitation of weighted least squares error gave better results in all cases, at the range of 3% - 6%. The typical used geosection was 4000m*4000m*2000m discretized with 11x11x6, 21x21x11 and 41x41x21 of 3D nodes. Bodies were represented by vertical prisms with section 400m*400m and different heights. The run-time of the single body geosection resulted up to several hours for a single processor computer for the geosection with 41x41x21 nodes. Parallel processing with OpenMP and MPI was used for geosections of 81x81x41 nodes (using finite cuboid elements with edge size 50m) in parallel systems of Bulgarian Academy of Sciences and of Super Computing Center of NIIFI in Hungary. Using up to 1,000 processors the run-time resulted about 24 hours, and it was evaluated that for a 3D array of 161x161x81 nodes (cuboids with edge 25m) the run time in 1,000 cores would be up to one year. The quality of inverted geosections resulted good in case of single body models, the algorithm offered clear contrast between the mass density of the body and the environment, and the shapes of original and inverted prisms resulted quite similar. In two body cases better solutions were obtained for shallow bodies, with the depth the tendency of the algorithm was to delineate only the shallow tops of prisms and compensate with a single mass at the depth. The algorithm was tested also with two real cases of typical gravity anomalies observed in Albanides.

Frasheri, Neki; Bushati, Salvatore; Frasheri, Alfred

2013-04-01

453

Parallel PSO using MapReduce  

Microsoft Academic Search

In optimization problems involving large amounts of data, such as web content, commercial transaction information, or bioinformatics data, individual function evaluations may take minutes or even hours. particle swarm optimization (PSO) must be parallelized for such functions. However, large-scale parallel programs must communicate efficiently, balance work across all processors, and address problems such as failed nodes. We present mapreduce particle

Andrew W. Mcnabb; Christopher K. Monson; Kevin D. Seppi

2007-01-01

454

Parallel Tiled QR Factorization for Multicore Architectures  

Microsoft Academic Search

As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose syn- chronization in the parallel execution of

Alfredo Buttari; Julien Langou; Jakub Kurzak; Jack Dongarra

2007-01-01

455

Parallel supercomputing today and the cedar approach  

Microsoft Academic Search

More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical;

D. J. Kuck; E. S. Davidson; D. H. Lawrie; A. H. Sameh

1986-01-01

456

Optically smart active antenna arrays  

Microsoft Academic Search

A prototype X-band active antenna array with adaptive optical processing is presented. The optical processor, referred to as an auto-tuning filter, is able to extract the strongest principal component in a two-signal space with up to 30 dB enhancement with respect to the other signals. The processor is compact (8 cm by 4 cm) and scalable to a large number

Dana Z. Anderson; V. Damiao; Edeline Fotheringham; Darko Popovic; Stefania Romisch; Zoya Popovic

2000-01-01

457

Systolic Array Synthesis: Computability and Time Cones.  

National Technical Information Service (NTIS)

Many important algorithms in signal and image processing, speech and pattern recognition of matrix computations consist of coupled systems of recurrence equations. Systolic arrays are regular networks of tightly coupled simple processors with limited stor...

J. M. Delosme I. C. Ipsen

1986-01-01

458

Tomographic Image Reconstruction Using Systolic Array Algorithms.  

National Technical Information Service (NTIS)

Image reconstruction for Computed Tomography (CT) is a time consuming operation on current uniprocessor computers and even on array processors. This is particularly true for three-dimensional data sets or for limited-data reconstructions requiring iterati...

S. G. Azevedo A. J. DeGroot D. J. Schneberk J. M. Brase H. E. Martz

1988-01-01

459

VCSEL-based parallel optical transmission module  

Microsoft Academic Search

This paper describes the design process and performance of the optimized parallel optical transmission module. Based on 1×12 VCSEL (Vertical Cavity Surface Emitting Laser) array, we designed and fabricated the high speed parallel optical modules. Our parallel optical module contains a 1×12 VCSEL array, a 12 channel CMOS laser driver circuit, a high speed PCB (Printed Circuit Board), a MT

Rongxuan Shen; Hongda Chen; Chao Zuo; Weihua Pei; Yi Zhou; Jun Tang

2005-01-01

460

SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processors with scanning CCD photodetectors  

NASA Astrophysics Data System (ADS)

Two new types of optoelectronic radio-signal processors were investigated. Charge-coupled device (CCD) photodetectors are used in these processors under continuous scanning conditions, i.e. in a time delay and storage mode. One of these processors is based on a CCD photodetector array with a reference-signal amplitude transparency and the other is an adaptive acousto-optical signal processor with linear frequency modulation. The processor with the transparency performs multichannel discrete—analogue convolution of an input signal with a corresponding kernel of the transformation determined by the transparency. If a light source is an array of light-emitting diodes of special (stripe) geometry, the optical stages of the processor can be made from optical fibre components and the whole processor then becomes a rigid 'sandwich' (a compact hybrid optoelectronic microcircuit). A report is given also of a study of a prototype processor with optical fibre components for the reception of signals from a system with antenna aperture synthesis, which forms a radio image of the Earth.

Esepkina, N. A.; Lavrov, A. P.; Anan'ev, M. N.; Blagodarnyi, V. S.; Ivanov, S. I.; Mansyrev, M. I.; Molodyakov, S. A.

1995-10-01

461

Proof of concept of regional scale hydrologic simulations at hydrologic resolution utilizing massively parallel computer resources  

Microsoft Academic Search

We present the results of a unique, parallel scaling study using a 3-D variably saturated flow problem including land surface processes that ranges from a single processor to a maximum number of 16,384 processors. In the applied finite difference framework and for a fixed problem size per processor, this results in a maximum number of approximately 8 × 109 grid

Stefan J. Kollet; Reed M. Maxwell; Carol S. Woodward; Steve Smith; Jan Vanderborght; Harry Vereecken; Clemens Simmer

2010-01-01

462

Ultra-Wideband Direction Finding Using a Fiber Optic Beamforming Processor.  

National Technical Information Service (NTIS)

This paper describes a wideband electro-optic direction finding (DF) processor employing an array of laser diodes, an array of photodetectors, and a network of fiber optic delay lines. This DF filter offers a potential operational bandwidth in excess of 1...

S. A. Pappert

1988-01-01

463

Parallel Mapping Approaches for GNUMAP  

PubMed Central

Mapping short next-generation reads to reference genomes is an important element in SNP calling and expression studies. A major limitation to large-scale whole-genome mapping is the large memory requirements for the algorithm and the long run-time necessary for accurate studies. Several parallel implementations have been performed to distribute memory on different processors and to equally share the processing requirements. These approaches are compared with respect to their memory footprint, load balancing, and accuracy. When using MPI with multi-threading, linear speedup can be achieved for up to 256 processors.

Clement, Nathan L.; Clement, Mark J.; Snell, Quinn; Johnson, W. Evan

2013-01-01

464

Performance of a parallel bispectrum estimation code  

NASA Astrophysics Data System (ADS)

We compare the performance of three parallel supercomputers executing a bispectrum estimation code used to remove distortions from astronomical data. We discuss the issues in parallelizing the code on an 8-processor shared-memory CRAY Y-MP and a 1024-processor distributed-memory nCUBE machine. Results show that elapsed times on the nCUBE machine are comparable to those on the CRAY Y-MP. Execution of the nCUBE was more than 40 times faster than that of a single processor CRAY-2 resulting in more than 50 times better cost performance. Cost performance on the nCUBE is more than 25 times better than an 8- processor CRAY Y-MP.

Carmona, Edward A.; Matson, Charles L.

1991-12-01

465

Benchmarking NWP Kernels on Multi- and Many-core Processors  

NASA Astrophysics Data System (ADS)

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.

Michalakes, J.; Vachharajani, M.

2008-12-01

466

Implementation of an ADI Method on parallel computers  

Microsoft Academic Search

In this paper we discuss the implementation of an ADI method for solving the diffusion equation on three parallel\\/vector computers. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex\\/32, an MIMD machine with 20 processors; and Cray\\/2, an MIMD machine with four vector processors. The

Raad A. Fatoohi; Chester E. Grosch

1987-01-01

467

An energy saving strategy based on adaptive loop parallelization  

Microsoft Academic Search

In this paper, we evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop in a multi-processor on-a-chip (MPoC) are shut down (i.e., placed into a power-down or sleep

Ismail Kadayif; Mahmut T. Kandemir; M. Karakoy

2002-01-01

468

A Programming Model for Massive Data Parallelism with Data Dependencies  

Microsoft Academic Search

Accelerating processors can often be more cost and energy effective for a wide range of data-parallel computing problems than general-purpose processors. For graphics processor units (GPUs), this is particularly the case when program development is aided by environments such as NVIDIA s Compute Unified Device Architecture (CUDA), which dramatically reduces the gap between domain-specific architectures and general purpose programming. Nonetheless,

Xiaohui Cui; Frank Mueller; Thomas E Potok; Yongpeng Zhang

2009-01-01

469

Green Secure Processors: Towards Power-Efficient Secure Processor Design  

NASA Astrophysics Data System (ADS)

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.

Chhabra, Siddhartha; Solihin, Yan

470

Response matrix transport calculations on parallel computers  

SciTech Connect

The response matrix method offers an excellent vehicle for adapting three-dimensional neutron transport methods to parallel computers. Our current thrust is in utilizing the three-dimensional Variational nodal code VARIANT as a point of departure for performing three- dimensional parallel computations on the IBM SPx at Argonne National Laboratory. The code employs a planar red-black iteration with a secondary red-black or four-color iteration within each plane. Speed- up and efficiency results have been obtained with a two-stage parallel implementation. First, the response matrix coefficients are calculated in parallel for each unique node type. Second, parallel iterations are performed with one red-black pair of planes assigned to each processor. A hierarchical structure may be employed to obtain finer parallel granularity by assigning multiple processors to the planer red-black or four-color iterations.

Hanebutte, U.R.; Palmiotti, G.; Khalil, H.S. [Argonne National Lab., IL (United States). Reactor Analysis Div.; Tatsumi, M. [Nuclear Fuel Industries, Ltd., Osaka (Japan). Nuclear Engineering Div.; Lewis, E.E. [Northwestern Univ., Evanston, IL (United States). Dept. of Mechanical Engineering

1996-12-31

471

SIMD-parallel understanding of natural language with application to magnitude-only optical parsing of text  

NASA Astrophysics Data System (ADS)

A novel parallel model of natural language (NL) understanding is presented which can realize high levels of semantic abstraction, and is designed for implementation on synchronous SIMD architectures and optical processors. Theory is expressed in terms of the Image Algebra (IA), a rigorous, concise, inherently parallel notation which unifies the design, analysis, and implementation of image processing algorithms. The IA has been implemented on numerous parallel architectures, and IA preprocessors and interpreters are available for the FORTRAN and Ada languages. In a previous study, we demonstrated the utility of IA for mapping MEA- conformable (Multiple Execution Array) algorithms to optical architectures. In this study, we extend our previous theory to map serial parsing algorithms to the synchronous SIMD paradigm. We initially derive a two-dimensional image that is based upon the adjacency matrix of a semantic graph. Via IA template mappings, the operations of bottom-up parsing, semantic disambiguation, and referential resolution are implemented as image-processing operations upon the adjacency matrix. Pixel-level operations are constrained to Hadamard addition and multiplication, thresholding, and row/column summation, which are available in magnitude-only optics. Assuming high parallelism in the parse rule base, the parsing of n input symbols with a grammar consisting of M rules of arity H, on an N-processor architecture, could exhibit time complexity of T(n) parallelism, the computational cost is constant and of order H. Since H < < n is typical, we claim a fundamental complexity advantage over the current O(n) theoretical time limit of MIMD parsing architectures. Additionally, we show that inference over a semantic net is achievable is parallel in O(m) time, where m corresponds to the depth of the search tree. Results are evaluated in terms of computational cost on SISD and SIMD processors, with discussion of implementation on electro-optic architectures.

Schmalz, Mark S.

1992-08-01

472

Parallel matrix transpose algorithms on distributed memory concurrent computers  

SciTech Connect

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

Choi, Jaeyoung [Tennessee Univ., Knoxville, TN (United States); Dongarra, J. [Oak Ridge National Lab., TN (United States)]|[Tennessee Univ., Knoxville, TN (United States); Walker, D.W. [Oak Ridge National Lab., TN (United States)

1994-12-31

473

Automated anomaly detection processor  

NASA Astrophysics Data System (ADS)

Robust exploitation of tracking and surveillance data will provide an early warning and cueing capability for military and civilian Law Enforcement Agency operations. This will improve dynamic tasking of limited resources and hence operational efficiency. The challenge is to rapidly identify threat activity within a huge background of noncombatant traffic. We discuss development of an Automated Anomaly Detection Processor (AADP) that exploits multi-INT, multi-sensor tracking and surveillance data to rapidly identify and characterize events and/or objects of military interest, without requiring operators to specify threat behaviors or templates. The AADP has successfully detected an anomaly in traffic patterns in Los Angeles, analyzed ship track data collected during a Fleet Battle Experiment to detect simulated mine laying behavior amongst maritime noncombatants, and is currently under development for surface vessel tracking within the Coast Guard's Vessel Traffic Service to support port security, ship inspection, and harbor traffic control missions, and to monitor medical surveillance databases for early alert of a bioterrorist attack. The AADP can also be integrated into combat simulations to enhance model fidelity of multi-sensor fusion effects in military operations.

Kraiman, James B.; Arouh, Scott L.; Webb, Michael L.

2002-07-01

474

Efficient parallel solution of linear systems  

Microsoft Academic Search

The most efficient known parallel algorithms for inversion of a nonsingular n × n matrix A or solving a linear system Ax = b over the rationals require &Ogr;(log n)2 time and M(n)n0.5 processors (where M(n) is the number of processors required in order to multiply two n × n rational matrices in time &Ogr;(log n).) Furthermore, all known polylog

Victor Y. Pan; John H. Reif

1985-01-01

475

Tuning of Kilopixel Transition Edge Sensor Bolometer Arrays with a Digital Frequency Multiplexed Readout System  

NASA Astrophysics Data System (ADS)

A digital frequency multiplexing (DfMUX) system has been developed and used to tune large arrays of transition edge sensor (TES) bolometers read out with SQUID arrays for mm-wavelength cosmology telescopes. The DfMUX system multiplexes the input bias voltages and output currents for several bolometers on a single set of cryogenic wires. Multiplexing reduces the heat load on the camera's sub-Kelvin cryogenic detector stage. In this paper we describe the algorithms and software used to set up and optimize the operation of the bolometric camera. The algorithms are implemented on soft processors embedded within FPGA devices operating on each backend readout board. The result is a fully parallelized implementation for which the setup time is independent of the array size.

MacDermid, Kevin; Hyland, Peter; Aubin, Francois; Bissonnette, Eric; Dobbs, Matt; Hubmayr, Johannes; Smecher, Graeme; Wairrach, Shahjahen

2009-12-01

476

Hardware-modulated parallelism in chip multiprocessors  

Microsoft Academic Search

Chip multi-processors (CMPs) already have widespread com- mercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current pro- cessor cores. How will we express parallelism, partition appli- cations, and schedule\\/place\\/migrate threads on these highly- parallel CMPs? This paper presents and evaluates a new approach to highly- parallel CMPs, advocating a new hardware-software contract. The

Julia Chen; Philo Juang; Kevin Ko; Gilberto Contreras; David Penry; Ram Rangan; Adam Stoler; Li-shiuan Peh; Margaret Martonosi

2005-01-01

477

Equalizer: a scalable parallel rendering framework  

Microsoft Academic Search

Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and

Stefan Eilemann; Maxim Makhinya; Renato Pajarola

2008-01-01

478

Equalizer: A Scalable Parallel Rendering Framework  

Microsoft Academic Search

Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and

Stefan Eilemann; Maxim Makhinya; Renato Pajarola

2009-01-01

479

New Challenges of Parallel Job Scheduling  

Microsoft Academic Search

The workshop on job scheduling strategies for parallel processing (JSSPP) studies the myriad aspects of managing resources\\u000a on parallel and distributed computers. These studies typically focus on large-scale computing environments, where allocation\\u000a and management of computing resources present numerous challenges. Traditionally, such systems consisted of massively parallel\\u000a supercomputers, or more recently, large clusters of commodity processor nodes. These systems are

Eitan Frachtenberg; Uwe Schwiegelshohn

2007-01-01