Algorithms versus architectures for computational chemistry
NASA Technical Reports Server (NTRS)
Partridge, H.; Bauschlicher, C. W., Jr.
1986-01-01
The algorithms employed are computationally intensive and, as a result, increased performance (both algorithmic and architectural) is required to improve accuracy and to treat larger molecular systems. Several benchmark quantum chemistry codes are examined on a variety of architectures. While these codes are only a small portion of a typical quantum chemistry library, they illustrate many of the computationally intensive kernels and data manipulation requirements of some applications. Furthermore, understanding the performance of the existing algorithm on present and proposed supercomputers serves as a guide for future programs and algorithm development. The algorithms investigated are: (1) a sparse symmetric matrix vector product; (2) a four index integral transformation; and (3) the calculation of diatomic two electron Slater integrals. The vectorization strategies are examined for these algorithms for both the Cyber 205 and Cray XMP. In addition, multiprocessor implementations of the algorithms are looked at on the Cray XMP and on the MIT static data flow machine proposed by DENNIS.
Parallel algorithms and architecture for computation of manipulator forward dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Communication-efficient parallel architectures and algorithms for image computations
Alnuweiri, H.M.
1989-01-01
The main purpose of this dissertation is the design of efficient parallel techniques for image computations which require global operations on image pixels, as well as the development of parallel architectures with special communication features which can support global data movement efficiently. The class of image problems considered in this dissertation involves global operations on image pixels, and irregular (data-dependent) data movement operations. Such problems include histogramming, component labeling, proximity computations, computing the Hough Transform, computing convexity of regions and related properties such as computing the diameter and a smallest area enclosing rectangle for each region. Images with multiple figures and multiple labeled-sets of pixels are also considered. Efficient solutions to such problems involve integer sorting, graph theoretic techniques, and techniques from computational geometry. Although such solutions are not computationally intensive (they all require O(n{sup 2}) operations to be performed on an n {times} n image), they require global communications. The emphasis here is on developing parallel techniques for data movement, reduction, and distribution, which lead to processor-time optimal solutions for such problems on the proposed organizations. The proposed parallel architectures are based on a memory array which can be viewed as an arrangement of memory modules in a k-dimensional space such that the modules are connected to buses placed parallel to the orthogonal axes of the space, and each bus is connected to one processor or a group of processors. It will be shown that such organizations are communication-efficient and are thus highly suited to the image problems considered here, and also to several other classes of problems. The proposed organizations have p processors and O(n{sup 2}) words of memory to process n {times} n images.
Algorithmic and architectural optimizations for computationally efficient particle filtering.
Sankaranarayanan, Aswin C; Srivastava, Ankur; Chellappa, Rama
2008-05-01
In this paper, we analyze the computational challenges in implementing particle filtering, especially to video sequences. Particle filtering is a technique used for filtering nonlinear dynamical systems driven by non-Gaussian noise processes. It has found widespread applications in detection, navigation, and tracking problems. Although, in general, particle filtering methods yield improved results, it is difficult to achieve real time performance. In this paper, we analyze the computational drawbacks of traditional particle filtering algorithms, and present a method for implementing the particle filter using the Independent Metropolis Hastings sampler, that is highly amenable to pipelined implementations and parallelization. We analyze the implementations of the proposed algorithm, and, in particular, concentrate on implementations that have minimum processing times. It is shown that the design parameters for the fastest implementation can be chosen by solving a set of convex programs. The proposed computational methodology was verified using a cluster of PCs for the application of visual tracking. We demonstrate a linear speed-up of the algorithm using the methodology proposed in the paper. PMID:18390378
A fast algorithm for parallel computation of multibody dynamics on MIMD parallel architectures
NASA Technical Reports Server (NTRS)
Fijany, Amir; Kwan, Gregory; Bagherzadeh, Nader
1993-01-01
In this paper the implementation of a parallel O(LogN) algorithm for computation of rigid multibody dynamics on a Hypercube MIMD parallel architecture is presented. To our knowledge, this is the first algorithm that achieves the time lower bound of O(LogN) by using an optimal number of O(N) processors. However, in addition to its theoretical significance, the algorithm is also highly efficient for practical implementation on commercially available MIMD parallel architectures due to its highly coarse grain size and simple communication and synchronization requirements. We present a multilevel parallel computation strategy for implementation of the algorithm on a Hypercube. This strategy allows the exploitation of parallelism at several computational levels as well as maximum overlapping of computation and communication to increase the performance of parallel computation.
Parallel algorithms and architectures
Albrecht, A.; Jung, H.; Mehlhorn, K.
1987-01-01
Contents of this book are the following: Preparata: Deterministic simulation of idealized parallel computers on more realistic ones; Convex hull of randomly chosen points from a polytope; Dataflow computing; Parallel in sequence; Towards the architecture of an elementary cortical processor; Parallel algorithms and static analysis of parallel programs; Parallel processing of combinatorial search; Communications; An O(nlogn) cost parallel algorithms for the single function coarsest partition problem; Systolic algorithms for computing the visibility polygon and triangulation of a polygonal region; and RELACS - A recursive layout computing system. Parallel linear conflict-free subtree access.
Sanz, J.L.; Dinstein, I.
1987-01-01
In this paper, some image transforms and features such as projections along linear patterns, convex hull approximations, Hough transform for line detection, diameter, moments, and principal components will be considered. Specifically, we present algorithms for computing these features which are suitable for implementation in image analysis pipeline architectures. In particular, random access memories and other dedicated hardware components which may be found in the implementation of classical techniques are not longer needed in our algorithms. The effectiveness of our approach is demonstrated by running some of the new algorithms in conventional short-pipelines for image analysis.
Algorithmically specialized parallel computers
Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.
1985-01-01
This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.
Design And Implementation Of A Multi-Sensor Fusion Algorithm On A Hypercube Computer Architecture
NASA Astrophysics Data System (ADS)
Glover, Charles W.
1990-03-01
A multi-sensor integration (MSI) algorithm written for sequential single processor computer architecture has been transformed into a concurrent algorithm and implemented in parallel on a multi-processor hypercube computer architecture. This paper will present the philosophy and methodologies used in the decomposition of the sequential MSI algorithm, and its transformation into a parallel MSI algorithm. The parallel MSI algorithm was implemented on a NCUBETM hypercube computer. The performance of the parallel MSI algorithm has been measured and compared against its sequential counterpart by running test case scenarios through a simulation program. The simulation program allows the user to define the trajectories of all players in the scenario, and to pick the sensor suites of the players and their operating characteristics. For example, an air-to-air engagement scenario was used as one of the test cases. In this scenario, two friend aircrafts were being attacked by six foe aircraft in a pincer maneuver. Both the friend and foe aircrafts launch missiles at several different time points in the engagement. The sensor suites on each aircraft are dual mode RADAR, dual mode IRST, and ESM sensors. The modes of the sensors are switched as needed throughout the scenario. The RADAR sensor is used only intermittently, thus most of the MSI information is obtained from passive sensing. The maneuvers in this scenario caused aircraft and missile to constantly fly in and out of sensors field-of-view (F0V). This resulted in the MSI algorithm to constantly reacquire, initiate, and delete new tracks as it tracked all objects in the scenario. The objective was to determine performance of the parallel MSI algorithm in such a complex environment, and to determine how many multi-processors (nodes) of the hypercube could be effectively used by an aircraft in such an environment. For the scenario just discussed, a 4-node hypercube was found to be the optimal size and a factor two in speedup
Introduction to the special section on computer architectures and parallel algorithms for PAMI
Dyer, C.R.
1989-03-01
The topic of multiprocessor computer architectures and parallel algorithms for computer vision and related applications is not new, but researchers are now addressing both a wider scope of issues and emphasizing system integration. Recently, a wide variety of different systems have been designed, built, and tested on a range of image understanding tasks. An important goal beginning to be addressed is how to achieve high performance when a complete, integrated set of component vision processes are combined. The papers in this special section describe a number of approaches to improving the performance of vision architectures. Each paper uses a different model of parallel processing. The first four papers describe machines or chips which have been built, each exhibiting certain advantages for vision. One important distinction between these approaches is in terms of the number of processors used, defining the granularity of parallel processing. The first three papers also evaluate the performance of their systems on a suite of vision tasks covering several image representations and processing requirements.
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Youngblood, John N.; Saha, Aindam
1987-01-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Carroll, C.C.; Youngblood, J.N.; Saha, A.
1987-12-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Tiberghien, J.
1984-01-01
This book presents papers on supercomputers. Topics considered include decentralized computer architecture, new programming languages, data flow computers, reduction computers, parallel prefix calculations, structural and behavioral descriptions of digital systems, instruction sets, software generation, personal computing, and computer architecture education.
Parallel Architecture For Robotics Computation
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1990-01-01
Universal Real-Time Robotic Controller and Simulator (URRCS) is highly parallel computing architecture for control and simulation of robot motion. Result of extensive algorithmic study of different kinematic and dynamic computational problems arising in control and simulation of robot motion. Study led to development of class of efficient parallel algorithms for these problems. Represents algorithmically specialized architecture, in sense capable of exploiting common properties of this class of parallel algorithms. System with both MIMD and SIMD capabilities. Regarded as processor attached to bus of external host processor, as part of bus memory.
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-06-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner for scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.
Feng, Lu; Fedrigo, Enrico; Béchet, Clémentine; Brunner, Elisabeth; Pirani, Werther
2012-06-01
The European Southern Observatory (ESO) is studying the next generation giant telescope, called the European Extremely Large Telescope (E-ELT). With a 42 m diameter primary mirror, it is a significant step from currently existing telescopes. Therefore, the E-ELT with its instruments poses new challenges in terms of cost and computational complexity for the control system, including its adaptive optics (AO). Since the conventional matrix-vector multiplication (MVM) method successfully used so far for AO wavefront reconstruction cannot be efficiently scaled to the size of the AO systems on the E-ELT, faster algorithms are needed. Among those recently developed wavefront reconstruction algorithms, three are studied in this paper from the point of view of design, implementation, and absolute speed on three multicore multi-CPU platforms. We focus on a single-conjugate AO system for the E-ELT. The algorithms are the MVM, the Fourier transform reconstructor (FTR), and the fractal iterative method (FRiM). This study enhances the scaling of these algorithms with an increasing number of CPUs involved in the computation. We discuss implementation strategies, depending on various CPU architecture constraints, and we present the first quantitative execution times so far at the E-ELT scale. MVM suffers from a large computational burden, making the current computing platform undersized to reach timings short enough for AO wavefront reconstruction. In our study, the FTR provides currently the fastest reconstruction. FRiM is a recently developed algorithm, and several strategies are investigated and presented here in order to implement it for real-time AO wavefront reconstruction, and to optimize its execution time. The difficulty to parallelize the algorithm in such architecture is enhanced. We also show that FRiM can provide interesting scalability using a sparse matrix approach. PMID:22695596
Introduction to systolic algorithms and architectures
Bentley, J.L.; Kung, H.T.
1983-01-01
The authors survey the class of systolic special-purpose computer architectures and algorithms, which are particularly well-suited for implementation in very large scale integrated circuitry (VLSI). They give a brief introduction to systolic arrays for a reader with a broad technical background and some experience in using a computer, but who is not necessarily a computer scientist. In addition they briefly survey the technological advances in VLSI that led to the development of systolic algorithms and architectures. 38 references.
Parallel processing algorithms for hydrocodes on a computer with MIMD architecture (DENELCOR's HEP)
Hicks, D.L.
1983-11-01
In real time simulation/prediction of complex systems such as water-cooled nuclear reactors, if reactor operators had fast simulator/predictors to check the consequences of their operations before implementing them, events such as the incident at Three Mile Island might be avoided. However, existing simulator/predictors such as RELAP run slower than real time on serial computers. It appears that the only way to overcome the barrier to higher computing rates is to use computers with architectures that allow concurrent computations or parallel processing. The computer architecture with the greatest degree of parallelism is labeled Multiple Instruction Stream, Multiple Data Stream (MIMD). An example of a machine of this type is the HEP computer by DENELCOR. It appears that hydrocodes are very well suited for parallelization on the HEP. It is a straightforward exercise to parallelize explicit, one-dimensional Lagrangean hydrocodes in a zone-by-zone parallelization. Similarly, implicit schemes can be parallelized in a zone-by-zone fashion via an a priori, symbolic inversion of the tridiagonal matrix that arises in an implicit scheme. These techniques are extended to Eulerian hydrocodes by using Harlow's rezone technique. The extension from single-phase Eulerian to two-phase Eulerian is straightforward. This step-by-step extension leads to hydrocodes with zone-by-zone parallelization that are capable of two-phase flow simulation. Extensions to two and three spatial dimensions can be achieved by operator splitting. It appears that a zone-by-zone parallelization is the best way to utilize the capabilities of an MIMD machine. 40 references.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
NASA Astrophysics Data System (ADS)
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
2011-07-01
We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 × 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Architecture Adaptive Computing Environment
NASA Technical Reports Server (NTRS)
Dorband, John E.
2006-01-01
Architecture Adaptive Computing Environment (aCe) is a software system that includes a language, compiler, and run-time library for parallel computing. aCe was developed to enable programmers to write programs, more easily than was previously possible, for a variety of parallel computing architectures. Heretofore, it has been perceived to be difficult to write parallel programs for parallel computers and more difficult to port the programs to different parallel computing architectures. In contrast, aCe is supportable on all high-performance computing architectures. Currently, it is supported on LINUX clusters. aCe uses parallel programming constructs that facilitate writing of parallel programs. Such constructs were used in single-instruction/multiple-data (SIMD) programming languages of the 1980s, including Parallel Pascal, Parallel Forth, C*, *LISP, and MasPar MPL. In aCe, these constructs are extended and implemented for both SIMD and multiple- instruction/multiple-data (MIMD) architectures. Two new constructs incorporated in aCe are those of (1) scalar and virtual variables and (2) pre-computed paths. The scalar-and-virtual-variables construct increases flexibility in optimizing memory utilization in various architectures. The pre-computed-paths construct enables the compiler to pre-compute part of a communication operation once, rather than computing it every time the communication operation is performed.
Kennington, J.L.; Helgason, R.V.
1989-06-29
One of the most important computer architecture innovations to appear in the marketplace during the last ten years is parallel processing on a shared memory multicomputer. This report presents empirical results on a Sequent Symmetry S81 on four optimization models used in the area of personnel assignment. Both the detailed algorithms and computational results are presented. Contents: Minimal Spanning Trees - An Empirical Investigation of Parallel Algorithms; Dijkstra's Two-tree Shortest Path Algorithm; An Empirical Analysis of the Dense Assignment Problem; and Solving Generalized Network Problems on a Shared Memory Multiprocessor.
Mapping algorithms on regular parallel architectures
Lee, P.
1989-01-01
It is significant that many of time-intensive scientific algorithms are formulated as nested loops, which are inherently regularly structured. In this dissertation the relations between the mathematical structure of nested loop algorithms and the architectural capabilities required for their parallel execution are studied. The architectural model considered in depth is that of an arbitrary dimensional systolic array. The mathematical structure of the algorithm is characterized by classifying its data-dependence vectors according to the new ZERO-ONE-INFINITE property introduced. Using this classification, the first complete set of necessary and sufficient conditions for correct transformation of a nested loop algorithm onto a given systolic array of an arbitrary dimension by means of linear mappings is derived. Practical methods to derive optimal or suboptimal systolic array implementations are also provided. The techniques developed are used constructively to develop families of implementations satisfying various optimization criteria and to design programmable arrays efficiently executing classes of algorithms. In addition, a Computer-Aided Design system running on SUN workstations has been implemented to help in the design. The methodology, which deals with general algorithms, is illustrated by synthesizing linear and planar systolic array algorithms for matrix multiplication, a reindexed Warshall-Floyd transitive closure algorithm, and the longest common subsequence algorithm.
Kennington, J.L.; Helgason, R.V.
1990-10-01
One of the most important computer architecture innovations to appear in the market place during the last ten years is parallel processing on a shared memory multicomputer. This report presents new algorithms for a variety of network models along with empirical analysis on both sequential and parallel computers. An empirical study on the AT and T KORBX system is also presented. This system uses eight processors each of which has vector capability. Our research program objective is to develop and empirically test new parallel algorithms and software for a wide variety of optimization problems. The problems studied this past year include the shortest path problem, the assignment problem, the semi-assignment problem, the transportation problem, and the generalized network problem. Algorithms for all of these models have been developed and empirically tested on a variety of computers. In addition, we worked with the Military Airlift Command to test the ATT KORBX system located at Scott Air Force Base.
The flight telerobotic servicer: From functional architecture to computer architecture
NASA Technical Reports Server (NTRS)
Lumia, Ronald; Fiala, John
1989-01-01
After a brief tutorial on the NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) functional architecture, the approach to its implementation is shown. First, interfaces must be defined which are capable of supporting the known algorithms. This is illustrated by considering the interfaces required for the SERVO level of the NASREM functional architecture. After interface definition, the specific computer architecture for the implementation must be determined. This choice is obviously technology dependent. An example illustrating one possible mapping of the NASREM functional architecture to a particular set of computers which implements it is shown. The result of choosing the NASREM functional architecture is that it provides a technology independent paradigm which can be mapped into a technology dependent implementation capable of evolving with technology in the laboratory and in space.
Neural algorithms on VLSI concurrent architectures
Caviglia, D.D.; Bisio, G.M.; Parodi, G.
1988-09-01
The research concerns the study of neural algorithms for developing CAD tools with A.I. features in VLSI design activities. In this paper the focus is on optimization problems such as partitioning, placement and routing. These problems require massive computational power to be solved (NP-complete problems) and the standard approach is usually based on euristic techniques. Neural algorithms can be represented by a circuital model. This kind of representation can be easily mapped in a real circuit, which, however, features limited flexibility with respect to the variety of problems. In this sense the simulation of the neural circuit, by mapping it on a digital VLSI concurrent architecture seems to be preferrable; in addition this solution offers a wider choice with regard to algorithms characteristics (e.g. transfer curve of neural elements, reconfigurability of interconnections, etc.). The implementation with programmable components, such as transputers, allows an indirect mapping of the algorithm (one transputer for N neurons) accordingly to the dimension and the characteristics of the problem. In this way the neural algorithm described by the circuit is reduced to the algorithm that simulates the network behavior. The convergence properties of that formulation are studied with respect to the characteristics of the neural element transfer curve.
Stereoscopic depth perception for robot vision: algorithms and architectures
Safranek, R.J.; Kak, A.C.
1983-01-01
The implementation of depth perception algorithms for computer vision is considered. In automated manufacturing, depth information is vital for tasks such as path planning and 3-d scene analysis. The presentation begins with a survey of computer algorithms for stereoscopic depth perception. The emphasis is on the Marr-Poggio paradigm of human stereo vision and its computer implementation. In addition, a stereo matching algorithm based on the relaxation labelling technique is examined. A computer architecture designed to efficiently implement stereo matching algorithms, an MIMD array interfaced to a global memory, is presented. 9 references.
Parallel algorithms and architectures for the manipulator inertia matrix
Amin-Javaheri, M.
1989-01-01
Several parallel algorithms and architectures to compute the manipulator inertia matrix in real time are proposed. An O(N) and an O(log{sub 2}N) parallel algorithm based upon recursive computation of the inertial parameters of sets of composite rigid bodies are formulated. One- and two-dimensional systolic architectures are presented to implement the O(N) parallel algorithm. A cube architecture is employed to implement the diagonal element of the inertia matrix in O(log{sub 2}N) time and the upper off-diagonal elements in O(N) time. The resulting K{sub 1}O(N) + K{sub 2}O(log{sub 2}N) parallel algorithm is more efficient for a cube network implementation. All the architectural configurations are based upon a VLSI Robotics Processor exploiting fine-grain parallelism. In evaluation all the architectural configurations, significant performance parameters such as I/O time and idle time due to processor synchronization as well as CPU utilization and on-chip memory size are fully included. The O(N) and O(log{sub 2}N) parallel algorithms adhere to the precedence relationships among the processors. In order to achieve a higher speedup factor; however, parallel algorithms in conjunction with Non-Strict Computational Models are devised to relax interprocess precedence, and as a result, to decrease the effective computational delays. The effectiveness of the Non-strict Computational Algorithms is verified by computer simulations, based on a PUMA 560 robot manipulator. It is demonstrated that a combination of parallel algorithms and architectures results in a very effective approach to achieve real-time response for computing the manipulator inertia matrix.
Recursive computer architecture for VLSI
Treleaven, P.C.; Hopkins, R.P.
1982-01-01
A general-purpose computer architecture based on the concept of recursion and suitable for VLSI computer systems built from replicated (lego-like) computing elements is presented. The recursive computer architecture is defined by presenting a program organisation, a machine organisation and an experimental machine implementation oriented to VLSI. The experimental implementation is being restricted to simple, identical microcomputers each containing a memory, a processor and a communications capability. This future generation of lego-like computer systems are termed fifth generation computers by the Japanese. 30 references.
Computing architecture for autonomous microgrids
Goldsmith, Steven Y.
2015-09-29
A computing architecture that facilitates autonomously controlling operations of a microgrid is described herein. A microgrid network includes numerous computing devices that execute intelligent agents, each of which is assigned to a particular entity (load, source, storage device, or switch) in the microgrid. The intelligent agents can execute in accordance with predefined protocols to collectively perform computations that facilitate uninterrupted control of the .
Chang, C.Y.
1986-01-01
New results on efficient forms of decoding convolutional codes based on Viterbi and stack algorithms using systolic array architecture are presented. Some theoretical aspects of systolic arrays are also investigated. First, systolic array implementation of Viterbi algorithm is considered, and various properties of convolutional codes are derived. A technique called strongly connected trellis decoding is introduced to increase the efficient utilization of all the systolic array processors. The issues dealing with the composite branch metric generation, survivor updating, overall system architecture, throughput rate, and computations overhead ratio are also investigated. Second, the existing stack algorithm is modified and restated in a more concise version so that it can be efficiently implemented by a special type of systolic array called systolic priority queue. Three general schemes of systolic priority queue based on random access memory, shift register, and ripple register are proposed. Finally, a systematic approach is presented to design systolic arrays for certain general classes of recursively formulated algorithms.
Unifying parametrized VLSI Jacobi algorithms and architectures
NASA Astrophysics Data System (ADS)
Deprettere, Ed F. A.; Moonen, Marc
1993-11-01
Implementing Jacobi algorithms in parallel VLSI processor arrays is a non-trivial task, in particular when the algorithms are parametrized with respect to size and the architectures are parametrized with respect to space-time trade-offs. The paper is concerned with an approach to implement several time-adaptive Jacobi-type algorithms on a parallel processor array, using only Cordic arithmetic and asynchronous communications, such that any degree of parallelism, ranging from single-processor up to full-size array implementation, is supported by a `universal' processing unit. This result is attributed to a gracious interplay between algorithmic and architectural engineering.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than line relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. The parallel implementation of a V-cycle multiple semi-coarsened grid (MSG) algorithm or distributed-memory architectures such as the Intel iPSC/860 and Paragon computers is addressed. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. A mapping of an MSG algorithm to distributed-memory architectures that demonstrate how both levels of parallelism can be exploited is described. The results is a robust and effective multigrid algorithm for distributed-memory machines.
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
Optical linear algebra processors - Architectures and algorithms
NASA Technical Reports Server (NTRS)
Casasent, David
1986-01-01
Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.
NASA Astrophysics Data System (ADS)
Basu, S.; Ganguly, S.; Nemani, R. R.; Mukhopadhyay, S.; Milesi, C.; Votava, P.; Michaelis, A.; Zhang, G.; Cook, B. D.; Saatchi, S. S.; Boyda, E.
2014-12-01
Accurate tree cover delineation is a useful instrument in the derivation of Above Ground Biomass (AGB) density estimates from Very High Resolution (VHR) satellite imagery data. Numerous algorithms have been designed to perform tree cover delineation in high to coarse resolution satellite imagery, but most of them do not scale to terabytes of data, typical in these VHR datasets. In this paper, we present an automated probabilistic framework for the segmentation and classification of 1-m VHR data as obtained from the National Agriculture Imagery Program (NAIP) for deriving tree cover estimates for the whole of Continental United States, using a High Performance Computing Architecture. The results from the classification and segmentation algorithms are then consolidated into a structured prediction framework using a discriminative undirected probabilistic graphical model based on Conditional Random Field (CRF), which helps in capturing the higher order contextual dependencies between neighboring pixels. Once the final probability maps are generated, the framework is updated and re-trained by incorporating expert knowledge through the relabeling of misclassified image patches. This leads to a significant improvement in the true positive rates and reduction in false positive rates. The tree cover maps were generated for the state of California, which covers a total of 11,095 NAIP tiles and spans a total geographical area of 163,696 sq. miles. Our framework produced correct detection rates of around 85% for fragmented forests and 70% for urban tree cover areas, with false positive rates lower than 3% for both regions. Comparative studies with the National Land Cover Data (NLCD) algorithm and the LiDAR high-resolution canopy height model shows the effectiveness of our algorithm in generating accurate high-resolution tree cover maps.
Algorithmically Specialized Parallel Architecture For Robotics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1991-01-01
Computing system called Robot Mathematics Processor (RMP) contains large number of processor elements (PE's) connected in various parallel and serial combinations reconfigurable via software. Special-purpose architecture designed for solving diverse computational problems in robot control, simulation, trajectory generation, workspace analysis, and like. System an MIMD-SIMD parallel architecture capable of exploiting parallelism in different forms and at several computational levels. Major advantage lies in design of cells, which provides flexibility and reconfigurability superior to previous SIMD processors.
Savannah River Site computing architecture
Not Available
1991-03-29
A computing architecture is a framework for making decisions about the implementation of computer technology and the supporting infrastructure. Because of the size, diversity, and amount of resources dedicated to computing at the Savannah River Site (SRS), there must be an overall strategic plan that can be followed by the thousands of site personnel who make decisions daily that directly affect the SRS computing environment and impact the site`s production and business systems. This plan must address the following requirements: There must be SRS-wide standards for procurement or development of computing systems (hardware and software). The site computing organizations must develop systems that end users find easy to use. Systems must be put in place to support the primary function of site information workers. The developers of computer systems must be given tools that automate and speed up the development of information systems and applications based on computer technology. This document describes a proposal for a site-wide computing architecture that addresses the above requirements. In summary, this architecture is standards-based data-driven, and workstation-oriented with larger systems being utilized for the delivery of needed information to users in a client-server relationship.
Savannah River Site computing architecture
Not Available
1991-03-29
A computing architecture is a framework for making decisions about the implementation of computer technology and the supporting infrastructure. Because of the size, diversity, and amount of resources dedicated to computing at the Savannah River Site (SRS), there must be an overall strategic plan that can be followed by the thousands of site personnel who make decisions daily that directly affect the SRS computing environment and impact the site's production and business systems. This plan must address the following requirements: There must be SRS-wide standards for procurement or development of computing systems (hardware and software). The site computing organizations must develop systems that end users find easy to use. Systems must be put in place to support the primary function of site information workers. The developers of computer systems must be given tools that automate and speed up the development of information systems and applications based on computer technology. This document describes a proposal for a site-wide computing architecture that addresses the above requirements. In summary, this architecture is standards-based data-driven, and workstation-oriented with larger systems being utilized for the delivery of needed information to users in a client-server relationship.
Nonlinear hierarchical substructural parallelism and computer architecture
NASA Technical Reports Server (NTRS)
Padovan, Joe
1989-01-01
Computer architecture is investigated in conjunction with the algorithmic structures of nonlinear finite-element analysis. To help set the stage for this goal, the development is undertaken by considering the wide-ranging needs associated with the analysis of rolling tires which possess the full range of kinematic, material and boundary condition induced nonlinearity in addition to gross and local cord-matrix material properties.
Acoustic simulation in architecture with parallel algorithm
NASA Astrophysics Data System (ADS)
Li, Xiaohong; Zhang, Xinrong; Li, Dan
2004-03-01
In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
VLSI Architectures for Computing DFT's
NASA Technical Reports Server (NTRS)
Truong, T. K.; Chang, J. J.; Hsu, I. S.; Reed, I. S.; Pei, D. Y.
1986-01-01
Simplifications result from use of residue Fermat number systems. System of finite arithmetic over residue Fermat number systems enables calculation of discrete Fourier transform (DFT) of series of complex numbers with reduced number of multiplications. Computer architectures based on approach suitable for design of very-large-scale integrated (VLSI) circuits for computing DFT's. General approach not limited to DFT's; Applicable to decoding of error-correcting codes and other transform calculations. System readily implemented in VLSI.
Efficient tree codes on SIMD computer architectures
NASA Astrophysics Data System (ADS)
Olson, Kevin M.
1996-11-01
This paper describes changes made to a previous implementation of an N -body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ~ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.
Neural-network algorithms and architectures for pattern classification
Mao, Weidong.
1991-01-01
The study of the artificial neural networks is an integrated research field that involves the disciplines of applied mathematics, physics, neurobiology, computer science, information, control, parallel processing and VLSI. This dissertation deals with a number of topics from a broad spectrum of neural network research in models, algorithms, applications and VLSI architectures. Specifically, this dissertation is aimed at studying neural network algorithms and architectures for pattern classification tasks. The work presented in this dissertation has a wide range of applications including speech recognition, image recognition, and high level knowledge processing. Supervised neural networks, such as the back-propagation network, can be used for classification tasks as the result of approximating an input/output mapping. They are the approximation-based classifiers. The original gradient descent back propagation learning algorithm exhibits slow convergence speed. Fast algorithms such as the conjugate gradient and quasi-Newton algorithms can be adopted. The main emphasis on neural network classifiers in this dissertation is the competition-based classifiers. Due to the rapid advance in VLSI technology, parallel processing, and computer aided design (CAD), application-specific VLSI systems are becoming more and more powerful and feasible. In particular, VLSI array processors offer high speed and efficiency through their massive parallelism and pipelining, regularity, modularity, and local communication. A unified VLSI array architecture can be used for implementing neural networks and Hidden Markov Models. He also proposes a pipeline interleaving approach to design VLSI array architectures for real-time image and video signal processing.
Fast semivariogram computation using FPGA architectures
NASA Astrophysics Data System (ADS)
Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang
2015-02-01
The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments
Scalable computer architecture for digital vascular systems
NASA Astrophysics Data System (ADS)
Goddard, Iain; Chao, Hui; Skalabrin, Mark
1998-06-01
Digital vascular computer systems are used for radiology and fluoroscopy (R/F), angiography, and cardiac applications. In the United States alone, about 26 million procedures of these types are performed annually: about 81% R/F, 11% cardiac, and 8% angiography. Digital vascular systems have a very wide range of performance requirements, especially in terms of data rates. In addition, new features are added over time as they are shown to be clinically efficacious. Application-specific processing modes such as roadmapping, peak opacification, and bolus chasing are particular to some vascular systems. New algorithms continue to be developed and proven, such as Cox and deJager's precise registration methods for masks and live images in digital subtraction angiography. A computer architecture must have high scalability and reconfigurability to meet the needs of this modality. Ideally, the architecture could also serve as the basis for a nonvascular R/F system.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.
1988-01-01
The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
Gropp, William D.
2014-06-23
With the coming end of Moore's law, it has become essential to develop new algorithms and techniques that can provide the performance needed by demanding computational science applications, especially those that are part of the DOE science mission. This work was part of a multi-institution, multi-investigator project that explored several approaches to develop algorithms that would be effective at the extreme scales and with the complex processor architectures that are expected at the end of this decade. The work by this group developed new performance models that have already helped guide the development of highly scalable versions of an algebraic multigrid solver, new programming approaches designed to support numerical algorithms on heterogeneous architectures, and a new, more scalable version of conjugate gradient, an important algorithm in the solution of very large linear systems of equations.
Architectural Implications for Spatial Object Association Algorithms
Kumar, V S; Kurc, T; Saltz, J; Abdulla, G; Kohn, S R; Matarazzo, C
2009-01-29
Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).
Architectural Implications for Spatial Object Association Algorithms*
Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste
2013-01-01
Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244
Quadrant architecture for fast in-place algorithms
Besslich, P.W.; Kurowski, J.O.
1983-10-01
The architecture proposed is tailored to support Radix-2/sup k/ based in-place processing of pictorial data. The algorithms make use of signal-flow graphs to describe 2-dimensional in-place operations suitable for image processing. They may be executed on a general-purpose computer but may also be supported by a special parallel architecture. Major advantages of the scheme are in-place processing and parallel access to disjoint sections of memory only. A quadtree-like decomposition of the picture prevents blocking and queuing of private and common buses. 9 references.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Algorithmes et architectures pour ordinateurs quantiques supraconducteurs
NASA Astrophysics Data System (ADS)
Blais, A.
2003-09-01
Algorithms and architectures for superconducting quantum computers Since its formulation, information theory was based, implicitly, on the laws of classical physics. Such a formulation is however incomplete because it does not take into account quantum reality. During the last twenty years, expansion of theory information to include quantum effects has known growing interest. The practical realization of a system for quantum data processing system, a quantum computer, presents however many challenges. In this book, we are interested in various aspects of these challenges. We start by presenting algorithmic concepts like optimization of quantum computations and geometric quantum computation. We then consider various designs and aspects of qubits based on Josephson junctions. In particular, an original approach to the interaction between superconducting qubits is presented. This approach is very general since it can be applied to various designs of qubits. Finally, we are interested in read-out of the superconductic flux qubits. The detector suggested here has the advantage that it is possible to uncouple it from the qubit when no measurement is in progress. Depuis sa formulation, la théorie de l'information a été basée, implicitement, sur les lois de la physique classique. Une telle formulation est toutefois incomplète puisqu'elle ne tient pas compte de la réalité quantique. Au cours des vingt dernières années, l'expansion de la théorie de l'information, de façon à englober les effets purement quantiques, a connu un intérêt grandissant. La réalisation d'un système de traitement de l'information quantique, un ordinateur quantique, présente toutefois de nombreux défis. Dans cet ouvrage, on s'intéresse à différents aspects concernant ces défis. On commence par présenter des concepts algorithmiques comme l'optimisation de calculs quantiques et le calcul quantique géométrique. Par la suite, on s'intéresse à différents designs et aspects de l
NASA Technical Reports Server (NTRS)
Metcalfe, A. G.; Bodenheimer, R. E.
1976-01-01
A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
A heterogeneous hierarchical architecture for real-time computing
Skroch, D.A.; Fornaro, R.J.
1988-12-01
The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
A biconjugate gradient type algorithm on massively parallel architectures
NASA Technical Reports Server (NTRS)
Freund, Roland W.; Hochbruck, Marlis
1991-01-01
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
A memory-array architecture for computer vision
Balsara, P.T.
1989-01-01
With the fast advances in the area of computer vision and robotics there is a growing need for machines that can understand images at a very high speed. A conventional von Neumann computer is not suited for this purpose because it takes a tremendous amount of time to solve most typical image processing problems. Exploiting the inherent parallelism present in various vision tasks can significantly reduce the processing time. Fortunately, parallelism is increasingly affordable as hardware gets cheaper. Thus it is now imperative to study computer vision in a parallel processing framework. The author should first design a computational structure which is well suited for a wide range of vision tasks and then develop parallel algorithms which can run efficiently on this structure. Recent advances in VLSI technology have led to several proposals for parallel architectures for computer vision. In this thesis he demonstrates that a memory array architecture with efficient local and global communication capabilities can be used for high speed execution of a wide range of computer vision tasks. This architecture, called the Access Constrained Memory Array Architecture (ACMAA), is efficient for VLSI implementation because of its modular structure, simple interconnect and limited global control. Several parallel vision algorithms have been designed for this architecture. The choice of vision problems demonstrates the versatility of ACMAA for a wide range of vision tasks. These algorithms were simulated on a high level ACMAA simulator running on the Intel iPSC/2 hypercube, a parallel architecture. The results of this simulation are compared with those of sequential algorithms running on a single hypercube node. Details of the ACMAA processor architecture are also presented.
Computational Controls Workstation: Algorithms and hardware
NASA Technical Reports Server (NTRS)
Venugopal, R.; Kumar, M.
1993-01-01
The Computational Controls Workstation provides an integrated environment for the modeling, simulation, and analysis of Space Station dynamics and control. Using highly efficient computational algorithms combined with a fast parallel processing architecture, the workstation makes real-time simulation of flexible body models of the Space Station possible. A consistent, user-friendly interface and state-of-the-art post-processing options are combined with powerful analysis tools and model databases to provide users with a complete environment for Space Station dynamics and control analysis. The software tools available include a solid modeler, graphical data entry tool, O(n) algorithm-based multi-flexible body simulation, and 2D/3D post-processors. This paper describes the architecture of the workstation while a companion paper describes performance and user perspectives.
Brain architecture: a design for natural computation.
Kaiser, Marcus
2007-12-15
Fifty years ago, John von Neumann compared the architecture of the brain with that of the computers he invented and which are still in use today. In those days, the organization of computers was based on concepts of brain organization. Here, we give an update on current results on the global organization of neural systems. For neural systems, we outline how the spatial and topological architecture of neuronal and cortical networks facilitates robustness against failures, fast processing and balanced network activation. Finally, we discuss mechanisms of self-organization for such architectures. After all, the organization of the brain might again inspire computer architecture. PMID:17855223
Computational Biology, Advanced Scientific Computing, and Emerging Computational Architectures
2007-06-27
This CRADA was established at the start of FY02 with $200 K from IBM and matching funds from DOE to support post-doctoral fellows in collaborative research between International Business Machines and Oak Ridge National Laboratory to explore effective use of emerging petascale computational architectures for the solution of computational biology problems. 'No cost' extensions of the CRADA were negotiated with IBM for FY03 and FY04.
Task scheduling in dataflow computer architectures
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
Dataflow computers provide a platform for the solution of a large class of computational problems, which includes digital signal processing and image processing. Many typical applications are represented by a set of tasks which can be repetitively executed in parallel as specified by an associated dataflow graph. Research in this area aims to model these architectures, develop scheduling procedures, and predict the transient and steady state performance. Researchers at NASA have created a model and developed associated software tools which are capable of analyzing a dataflow graph and predicting its runtime performance under various resource and timing constraints. These models and tools were extended and used in this work. Experiments using these tools revealed certain properties of such graphs that require further study. Specifically, the transient behavior at the beginning of the execution of a graph can have a significant effect on the steady state performance. Transformation and retiming of the application algorithm and its initial conditions can produce a different transient behavior and consequently different steady state performance. The effect of such transformations on the resource requirements or under resource constraints requires extensive study. Task scheduling to obtain maximum performance (based on user-defined criteria), or to satisfy a set of resource constraints, can also be significantly affected by a transformation of the application algorithm. Since task scheduling is performed by heuristic algorithms, further research is needed to determine if new scheduling heuristics can be developed that can exploit such transformations. This work has provided the initial development for further long-term research efforts. A simulation tool was completed to provide insight into the transient and steady state execution of a dataflow graph. A set of scheduling algorithms was completed which can operate in conjunction with the modeling and performance tools
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Topological Code Architectures for Quantum Computation
NASA Astrophysics Data System (ADS)
Cesare, Christopher Anthony
This dissertation is concerned with quantum computation using many-body quantum systems encoded in topological codes. The interest in these topological systems has increased in recent years as devices in the lab begin to reach the fidelities required for performing arbitrarily long quantum algorithms. The most well-studied system, Kitaev's toric code, provides both a physical substrate for performing universal fault-tolerant quantum computations and a useful pedagogical tool for explaining the way other topological codes work. In this dissertation, I first review the necessary formalism for quantum information and quantum stabilizer codes, and then I introduce two families of topological codes: Kitaev's toric code and Bombin's color codes. I then present three chapters of original work. First, I explore the distinctness of encoding schemes in the color codes. Second, I introduce a model of quantum computation based on the toric code that uses adiabatic interpolations between static Hamiltonians with gaps constant in the system size. Lastly, I describe novel state distillation protocols that are naturally suited for topological architectures and show that they provide resource savings in terms of the number of required ancilla states when compared to more traditional approaches to quantum gate approximation.
Fibonacci Numbers and Computer Algorithms.
ERIC Educational Resources Information Center
Atkins, John; Geist, Robert
1987-01-01
The Fibonacci Sequence describes a vast array of phenomena from nature. Computer scientists have discovered and used many algorithms which can be classified as applications of Fibonacci's sequence. In this article, several of these applications are considered. (PK)
Computer algorithm for coding gain
NASA Technical Reports Server (NTRS)
Dodd, E. E.
1974-01-01
Development of a computer algorithm for coding gain for use in an automated communications link design system. Using an empirical formula which defines coding gain as used in space communications engineering, an algorithm is constructed on the basis of available performance data for nonsystematic convolutional encoding with soft-decision (eight-level) Viterbi decoding.
A high performance parallel computing architecture for robust image features
NASA Astrophysics Data System (ADS)
Zhou, Renyan; Liu, Leibo; Wei, Shaojun
2014-03-01
A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Teaching Computer Aided Architectural Design at UCLA.
ERIC Educational Resources Information Center
Mitchell, William J.
This brief overview includes a rationale for the program and describes course goals and objectives, curriculum content, teaching methods and materials, staffing, and problems of integrating computer aided design with traditional architectural curricula at the School of Architecture and Urban Planning at UCLA. A list of texts for use in teaching…
SCORPIUS algorithm benchmarks on the image understanding architecture machine
NASA Astrophysics Data System (ADS)
Bogdanowicz, Julius F.; Nash, J. Gregory; Shu, David B.
1992-04-01
Many Hughes tactical and strategic programs need high performance image processing. For example, photo-interpretation applications can require up to four orders of magnitude speedup over conventional computer architectures. Therefore, parallel processing systems are needed to help close the processing gap. Vision applications can usually be decomposed into three levels of processing called high, intermediate, and low level vision. Each processing level typically requires different types of numeric/symbolic computation, processing task granularities, and communications bandwidths. No parallel processing system is commercially available that is optimized for the entire range of computations. To meet these processing challenges, the image understanding architecture (IUA) has been developed by Hughes in collaboration with the University of Massachusetts. The IUA is a heterogeneous, hierarchical, associative parallel processor that is organized in three levels corresponding to the vision problem. Its lowest level consists of a large content addressable array parallel processor. This array of 'per pixel' bit serial processors is used for fixed point, low level numeric, and symbolic computations. The middle level is an interface communications array processor (ICAP). ICAP is an array of digital signal processing chips from TI TMS320Cx line, used for high speed number crunching. The highest level is the symbolic processing array. It is an array of general purpose microprocessors in which the artificial intelligence content of the image understanding software resides. A set of benchmarks from the DARPA/ORD sponsored SCORPIUS program were developed using the IUA. The set of algorithms included low level image processing as well as high level matching algorithms. Benchmark performance on the second generation IUA hardware is over four orders of magnitude faster than equivalent algorithms implemented on a DEC VAX 8650. The first generation hardware is operational. Development
Expandable computed-tomography architecture for nondestructive inspection
NASA Astrophysics Data System (ADS)
Agi, Iskender; Hurst, Paul J.; Current, K. W.
1993-04-01
The Radon transform and its inverse, commonly used for computed tomography (CT), are computationally burdensome for single processor computers. Since projection-based computations are easily executed in parallel, multiprocessor architectures have been proposed for high-speed operation. In this paper, we describe an architecture for a high-speed (30 MHz raster-scan image data rate), high accuracy (12-bits per pixel) computed-tomography system for use in non-destructive inspection system. This architecture reconstructs images from fan- or parallel-beam data using either single-pass or iterative reconstruction techniques. Our architecture uses a number of identical processor modules in a pipeline. Each processor module consists of memory for data storage, a commercially available digital signal processing (DSP) chip for filtering, and our custom IC which performs 450 million mathematical operations per second (MOPS). This architecture can reconstruct CT images as large as 1024 X 1024 pixels from a variety of image reconstruction algorithms. The details of the implementation and performance of our expandable architecture are discussed.
Computed laminography and reconstruction algorithm
NASA Astrophysics Data System (ADS)
Que, Jie-Min; Cao, Da-Quan; Zhao, Wei; Tang, Xiao; Sun, Cui-Li; Wang, Yan-Fang; Wei, Cun-Feng; Shi, Rong-Jian; Wei, Long; Yu, Zhong-Qiang; Yan, Yong-Lian
2012-08-01
Computed laminography (CL) is an alternative to computed tomography if large objects are to be inspected with high resolution. This is especially true for planar objects. In this paper, we set up a new scanning geometry for CL, and study the algebraic reconstruction technique (ART) for CL imaging. We compare the results of ART with variant weighted functions by computer simulation with a digital phantom. It proves that ART algorithm is a good choice for the CL system.
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1991-01-01
The Theory of Intelligent Machines proposes a hierarchical organization for the functions of an autonomous robot based on the Principle of Increasing Precision With Decreasing Intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed in recent years. A computer architecture that implements the lower two levels of the intelligent machine is presented. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Details of Execution Level controllers for motion and vision systems are addressed, as well as the Petri net transducer software used to implement Coordination Level functions. Extensions to UNIX and VxWorks operating systems which enable the development of a heterogeneous, distributed application are described. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1992-01-01
The theory of intelligent machines proposes a hierarchical organization for the functions of an autonomous robot based on the principle of increasing precision with decreasing intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed. The authors present a computer architecture that implements the lower two levels of the intelligent machine. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Execution-level controllers for motion and vision systems are briefly addressed, as well as the Petri net transducer software used to implement coordination-level functions. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
A VLSI architecture for simplified arithmetic Fourier transform algorithm
NASA Technical Reports Server (NTRS)
Reed, Irving S.; Shih, Ming-Tang; Truong, T. K.; Hendon, E.; Tufts, D. W.
1992-01-01
The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed in a previous paper for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain recent AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25 percent of that used in the direct method.
Cluster algorithms and computational complexity
NASA Astrophysics Data System (ADS)
Li, Xuenan
Cluster algorithms for the 2D Ising model with a staggered field have been studied and a new cluster algorithm for path sampling has been worked out. The complexity properties of Bak-Seppen model and the Growing network model have been studied by using the Computational Complexity Theory. The dynamic critical behavior of the two-replica cluster algorithm is studied. Several versions of the algorithm are applied to the two-dimensional, square lattice Ising model with a staggered field. The dynamic exponent for the full algorithm is found to be less than 0.5. It is found that odd translations of one replica with respect to the other together with global flips are essential for obtaining a small value of the dynamic exponent. The path sampling problem for the 1D Ising model is studied using both a local algorithm and a novel cluster algorithm. The local algorithm is extremely inefficient at low temperature, where the integrated autocorrelation time is found to be proportional to the fourth power of correlation length. The dynamic exponent of the cluster algorithm is found to be zero and therefore proved to be much more efficient than the local algorithm. The parallel computational complexity of the Bak-Sneppen evolution model is studied. It is shown that Bak-Sneppen histories can be generated by a massively parallel computer in a time that is polylog in the length of the history, which means that the logical depth of producing a Bak-Sneppen history is exponentially less than the length of the history. The parallel dynamics for generating Bak-Sneppen histories is contrasted to standard Bak-Sneppen dynamics. The parallel computational complexity of the Growing Network model is studied. The growth of the network with linear kernels is shown to be not complex and an algorithm with polylog parallel running time is found. The growth of the network with gamma ≥ 2 super-linear kernels can be realized by a randomized parallel algorithm with polylog expected running time.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1987-01-01
The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
THE COMPUTER AND THE ARCHITECTURAL PROFESSION.
ERIC Educational Resources Information Center
HAVILAND, DAVID S.
THE ROLE OF ADVANCING TECHNOLOGY IN THE FIELD OF ARCHITECTURE IS DISCUSSED IN THIS REPORT. PROBLEMS IN COMMUNICATION AND THE DESIGN PROCESS ARE IDENTIFIED. ADVANTAGES AND DISADVANTAGES OF COMPUTERS ARE MENTIONED IN RELATION TO MAN AND MACHINE INTERACTION. PRESENT AND FUTURE IMPLICATIONS OF COMPUTER USAGE ARE IDENTIFIED AND DISCUSSED WITH RESPECT…
Switching from Computer to Microcomputer Architecture Education
ERIC Educational Resources Information Center
Bolanakis, Dimosthenis E.; Kotsis, Konstantinos T.; Laopoulos, Theodore
2010-01-01
In the last decades, the technological and scientific evolution of the computing discipline has been widely affecting research in software engineering education, which nowadays advocates more enlightened and liberal ideas. This article reviews cross-disciplinary research on a computer architecture class in consideration of its switching to…
FFT Computation with Systolic Arrays, A New Architecture
NASA Technical Reports Server (NTRS)
Boriakoff, Valentin
1994-01-01
The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
A fully programmable computing architecture for medical ultrasound machines.
Schneider, Fabio Kurt; Agarwal, Anup; Yoo, Yang Mo; Fukuoka, Tetsuya; Kim, Yongmin
2010-03-01
Application-specific ICs have been traditionally used to support the high computational and data rate requirements in medical ultrasound systems, particularly in receive beamforming. Utilizing the previously developed efficient front-end algorithms, in this paper, we present a simple programmable computing architecture, consisting of a field-programmable gate array (FPGA) and a digital signal processor (DSP), to support core ultrasound signal processing. It was found that 97.3% and 51.8% of the FPGA and DSP resources are, respectively, needed to support all the front-end and back-end processing for B-mode imaging with 64 channels and 120 scanlines per frame at 30 frames/s. These results indicate that this programmable architecture can meet the requirements of low- and medium-level ultrasound machines while providing a flexible platform for supporting the development and deployment of new algorithms and emerging clinical applications. PMID:19546045
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Som, Sukhamoy; Stoughton, John W.; Mielke, Roland R.
1990-01-01
Performance modeling and performance enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures are discussed. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called algorithm to architecture mapping model (ATAMM). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.; Som, Sukhamony
1990-01-01
The performance modeling and enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures is examined. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called ATAMM (Algorithm To Architecture Mapping Model). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Algorithms Bridging Quantum Computation and Chemistry
NASA Astrophysics Data System (ADS)
McClean, Jarrod Ryan
The design of new materials and chemicals derived entirely from computation has long been a goal of computational chemistry, and the governing equation whose solution would permit this dream is known. Unfortunately, the exact solution to this equation has been far too expensive and clever approximations fail in critical situations. Quantum computers offer a novel solution to this problem. In this work, we develop not only new algorithms to use quantum computers to study hard problems in chemistry, but also explore how such algorithms can help us to better understand and improve our traditional approaches. In particular, we first introduce a new method, the variational quantum eigensolver, which is designed to maximally utilize the quantum resources available in a device to solve chemical problems. We apply this method in a real quantum photonic device in the lab to study the dissociation of the helium hydride (HeH+) molecule. We also enhance this methodology with architecture specific optimizations on ion trap computers and show how linear-scaling techniques from traditional quantum chemistry can be used to improve the outlook of similar algorithms on quantum computers. We then show how studying quantum algorithms such as these can be used to understand and enhance the development of classical algorithms. In particular we use a tool from adiabatic quantum computation, Feynman's Clock, to develop a new discrete time variational principle and further establish a connection between real-time quantum dynamics and ground state eigenvalue problems. We use these tools to develop two novel parallel-in-time quantum algorithms that outperform competitive algorithms as well as offer new insights into the connection between the fermion sign problem of ground states and the dynamical sign problem of quantum dynamics. Finally we use insights gained in the study of quantum circuits to explore a general notion of sparsity in many-body quantum systems. In particular we use
The new landscape of parallel computer architecture
NASA Astrophysics Data System (ADS)
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Astrophysics Data System (ADS)
Stoughton, John W.; Mielke, Roland R.
1988-02-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
The Fermilab Central Computing Facility architectural model
Nicholls, J.
1989-05-01
The goal of the current Central Computing Upgrade at Fermilab is to create a computing environment that maximizes total productivity, particularly for high energy physics analysis. The Computing Department and the Next Computer Acquisition Committee decided upon a model which includes five components: an interactive front end, a Large-Scale Scientific Computer (LSSC, a mainframe computing engine), a microprocessor farm system, a file server, and workstations. With the exception of the file server, all segments of this model are currently in production: a VAX/VMS Cluster interactive front end, an Amdahl VM computing engine, ACP farms, and (primarily) VMS workstations. This presentation will discuss the implementation of the Fermilab Central Computing Facility Architectural Model. Implications for Code Management in such a heterogeneous environment, including issues such as modularity and centrality, will be considered. Special emphasis will be placed on connectivity and communications between the front-end, LSSC, and workstations, as practiced at Fermilab. 2 figs.
Computing on Knights and Kepler Architectures
NASA Astrophysics Data System (ADS)
Bortolotti, G.; Caberletti, M.; Crimi, G.; Ferraro, A.; Giacomini, F.; Manzali, M.; Maron, G.; Pivanti, M.; Salomoni, D.; Schifano, S. F.; Tripiccione, R.; Zanella, M.
2014-06-01
A recent trend in scientific computing is the increasingly important role of co-processors, originally built to accelerate graphics rendering, and now used for general high-performance computing. The INFN Computing On Knights and Kepler Architectures (COKA) project focuses on assessing the suitability of co-processor boards for scientific computing in a wide range of physics applications, and on studying the best programming methodologies for these systems. Here we present in a comparative way our results in porting a Lattice Boltzmann code on two state-of-the-art accelerators: the NVIDIA K20X, and the Intel Xeon-Phi. We describe our implementations, analyze results and compare with a baseline architecture adopting Intel Sandy Bridge CPUs.
Quantum perceptron over a field and neural network architecture selection in a quantum computer.
da Silva, Adenilton José; Ludermir, Teresa Bernarda; de Oliveira, Wilson Rosa
2016-04-01
In this work, we propose a quantum neural network named quantum perceptron over a field (QPF). Quantum computers are not yet a reality and the models and algorithms proposed in this work cannot be simulated in actual (or classical) computers. QPF is a direct generalization of a classical perceptron and solves some drawbacks found in previous models of quantum perceptrons. We also present a learning algorithm named Superposition based Architecture Learning algorithm (SAL) that optimizes the neural network weights and architectures. SAL searches for the best architecture in a finite set of neural network architectures with linear time over the number of patterns in the training set. SAL is the first learning algorithm to determine neural network architectures in polynomial time. This speedup is obtained by the use of quantum parallelism and a non-linear quantum operator. PMID:26878722
Evaluation of Visual Computer Simulator for Computer Architecture Education
ERIC Educational Resources Information Center
Imai, Yoshiro; Imai, Masatoshi; Moritoh, Yoshio
2013-01-01
This paper presents trial evaluation of a visual computer simulator in 2009-2011, which has been developed to play some roles of both instruction facility and learning tool simultaneously. And it illustrates an example of Computer Architecture education for University students and usage of e-Learning tool for Assembly Programming in order to…
Highly parallel computer architecture for robotic computation
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)
1991-01-01
In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Wavelet Algorithms for Illumination Computations
NASA Astrophysics Data System (ADS)
Schroder, Peter
One of the core problems of computer graphics is the computation of the equilibrium distribution of light in a scene. This distribution is given as the solution to a Fredholm integral equation of the second kind involving an integral over all surfaces in the scene. In the general case such solutions can only be numerically approximated, and are generally costly to compute, due to the geometric complexity of typical computer graphics scenes. For this computation both Monte Carlo and finite element techniques (or hybrid approaches) are typically used. A simplified version of the illumination problem is known as radiosity, which assumes that all surfaces are diffuse reflectors. For this case hierarchical techniques, first introduced by Hanrahan et al. (32), have recently gained prominence. The hierarchical approaches lead to an asymptotic improvement when only finite precision is required. The resulting algorithms have cost proportional to O(k^2 + n) versus the usual O(n^2) (k is the number of input surfaces, n the number of finite elements into which the input surfaces are meshed). Similarly a hierarchical technique has been introduced for the more general radiance problem (which allows glossy reflectors) by Aupperle et al. (6). In this dissertation we show the equivalence of these hierarchical techniques to the use of a Haar wavelet basis in a general Galerkin framework. By so doing, we come to a deeper understanding of the properties of the numerical approximations used and are able to extend the hierarchical techniques to higher orders. In particular, we show the correspondence of the geometric arguments underlying hierarchical methods to the theory of Calderon-Zygmund operators and their sparse realization in wavelet bases. The resulting wavelet algorithms for radiosity and radiance are analyzed and numerical results achieved with our implementation are reported. We find that the resulting algorithms achieve smaller and smoother errors at equivalent work.
ATCA for Machines-- Advanced Telecommunications Computing Architecture
Larsen, R.S.; /SLAC
2008-04-22
The Advanced Telecommunications Computing Architecture is a new industry open standard for electronics instrument modules and shelves being evaluated for the International Linear Collider (ILC). It is the first industrial standard designed for High Availability (HA). ILC availability simulations have shown clearly that the capabilities of ATCA are needed in order to achieve acceptable integrated luminosity. The ATCA architecture looks attractive for beam instruments and detector applications as well. This paper provides an overview of ongoing R&D including application of HA principles to power electronics systems.
Implementing a computing architecture with WISDOM
Zebrowski, J.R.
1991-01-01
Over the past two years, the Savannah River Site (SRS) work force has expanded by more than 6000 employees. This large influx of personnel, in conjunction with the limited office space, has resulted in an overcrowding problem on site. To alleviate some of the overcrowding, Westinghouse Savannah River Company (WSRC) has been in the process of leasing space from several office buildings within Aiken, SC. Brookhaven, the latest off-site office building to be leased, is the starting point for a new direction in office automation which will eventually spread throughout SRS. The computing architecture in place at Brookhaven was designed to adhere to the SRS computer architecture guidelines as published by the WSRC Computer Architecture Standards Team (CAST). At the heart of the Brookhaven implementation is a Workstation Integration System for DOS, OS/2 and Macintosh (WISDOM). The key features of the WISDOM system include: it's utilization of a Local Area Network (LAN), it's Graphical User Interface (GUI), it's cross-platform capability, it's portable user interface, and the installation program. To begin, I will give an overview of the network architecture, then discuss WISDOM in detail, mention some platform integration problems that need to be addressed and conclude with a summary of the user benefits that WISDOM provides.
Implementing a computing architecture with WISDOM
Zebrowski, J.R.
1991-12-31
Over the past two years, the Savannah River Site (SRS) work force has expanded by more than 6000 employees. This large influx of personnel, in conjunction with the limited office space, has resulted in an overcrowding problem on site. To alleviate some of the overcrowding, Westinghouse Savannah River Company (WSRC) has been in the process of leasing space from several office buildings within Aiken, SC. Brookhaven, the latest off-site office building to be leased, is the starting point for a new direction in office automation which will eventually spread throughout SRS. The computing architecture in place at Brookhaven was designed to adhere to the SRS computer architecture guidelines as published by the WSRC Computer Architecture Standards Team (CAST). At the heart of the Brookhaven implementation is a Workstation Integration System for DOS, OS/2 and Macintosh (WISDOM). The key features of the WISDOM system include: it`s utilization of a Local Area Network (LAN), it`s Graphical User Interface (GUI), it`s cross-platform capability, it`s portable user interface, and the installation program. To begin, I will give an overview of the network architecture, then discuss WISDOM in detail, mention some platform integration problems that need to be addressed and conclude with a summary of the user benefits that WISDOM provides.
Computer graphics in architecture and engineering
NASA Technical Reports Server (NTRS)
Greenberg, D. P.
1975-01-01
The present status of the application of computer graphics to the building profession or architecture and its relationship to other scientific and technical areas were discussed. It was explained that, due to the fragmented nature of architecture and building activities (in contrast to the aerospace industry), a comprehensive, economic utilization of computer graphics in this area is not practical and its true potential cannot now be realized due to the present inability of architects and structural, mechanical, and site engineers to rely on a common data base. Future emphasis will therefore have to be placed on a vertical integration of the construction process and effective use of a three-dimensional data base, rather than on waiting for any technological breakthrough in interactive computing.
DFT algorithms for bit-serial GaAs array processor architectures
NASA Technical Reports Server (NTRS)
Mcmillan, Gary B.
1988-01-01
Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Analysis of an Atom-Optical Architecture for Quantum Computation
NASA Astrophysics Data System (ADS)
Devitt, Simon J.; Stephens, Ashley M.; Munro, William J.; Nemoto, Kae
Quantum technology based on photons has emerged as one of the most promising platforms for quantum information processing, having already been used in proof-of-principle demonstrations of quantum communication and quantum computation. However, the scalability of this technology depends on the successful integration of experimentally feasible devices in an architecture that tolerates realistic errors and imperfections. Here, we analyse an atom-optical architecture for quantum computation designed to meet the requirements of scalability. The architecture is based on a modular atom-cavity device that provides an effective photon-photon interaction, allowing for the rapid, deterministic preparation of a large class of entangled states. We begin our analysis at the physical level, where we outline the experimental cavity quantum electrodynamics requirements of the basic device. Then, we describe how a scalable network of these devices can be used to prepare a three-dimensional topological cluster state, sufficient for universal fault-tolerant quantum computation. We conclude at the application level, where we estimate the system-level requirements of the architecture executing an algorithm compiled for compatibility with the topological cluster state.
Towards optoelectronic architectures for integrated neuromorphic computers
NASA Astrophysics Data System (ADS)
Martinenghi, Romain; Baylon Fuentes, Antonio; Jacquot, Maxime; Chembo, Yanne K.; Larger, Laurent
2014-03-01
We investigate theoretically and experimentally the computational properties of an optoelectronic neuromorphic processor based on a complex nonlinear dynamics. This neuromorphic approach is based on a new paradigm of or reservoir computing, which is intrinsically different from the concept of Turing machines. It essentially consists in expanding the input information to be processed into a higher dimensional phase space, through the nonlinear transient response of a complex dynamics excited by the input information. The computed output is then extracted via a linear separation of the transient trajectory in the complex phase space, performed through a learning phase consisting of the resolution of a regression problem. We here investigate an architecture for photonic neuromorphic computing via these complex nonlinear dynamical transients. A versatile photonic nonlinear transient computer based on a multiple-delay is reported. Its hybrid analogue and digital architecture allows for an easy reconfiguration, and for direct implementation of in-line processing. Its computational efficiency in parameter space is also analyzed, and the computational performance of this system is successfully evaluated on a standard spoken digit recognition task. We then discuss the pathways that can lead to its effective integration.
NASA Astrophysics Data System (ADS)
Karthik, Victor U.; Sivasuthan, Sivamayam; Hoole, Samuel Ratnajeevan H.
2014-02-01
The computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal - a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. The geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. The most efficient like conjugate gradients are very complex to program for the required derivative information. The least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. This paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. The efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
Roadmap to the SRS computing architecture
Johnson, A.
1994-07-05
This document outlines the major steps that must be taken by the Savannah River Site (SRS) to migrate the SRS information technology (IT) environment to the new architecture described in the Savannah River Site Computing Architecture. This document proposes an IT environment that is {open_quotes}...standards-based, data-driven, and workstation-oriented, with larger systems being utilized for the delivery of needed information to users in a client-server relationship.{close_quotes} Achieving this vision will require many substantial changes in the computing applications, systems, and supporting infrastructure at the site. This document consists of a set of roadmaps which provide explanations of the necessary changes for IT at the site and describes the milestones that must be completed to finish the migration.
Efficient Universal Computing Architectures for Decoding Neural Activity
Rapoport, Benjamin I.; Turicchia, Lorenzo; Wattanapanitch, Woradorn; Davidson, Thomas J.; Sarpeshkar, Rahul
2012-01-01
The ability to decode neural activity into meaningful control signals for prosthetic devices is critical to the development of clinically useful brain– machine interfaces (BMIs). Such systems require input from tens to hundreds of brain-implanted recording electrodes in order to deliver robust and accurate performance; in serving that primary function they should also minimize power dissipation in order to avoid damaging neural tissue; and they should transmit data wirelessly in order to minimize the risk of infection associated with chronic, transcutaneous implants. Electronic architectures for brain– machine interfaces must therefore minimize size and power consumption, while maximizing the ability to compress data to be transmitted over limited-bandwidth wireless channels. Here we present a system of extremely low computational complexity, designed for real-time decoding of neural signals, and suited for highly scalable implantable systems. Our programmable architecture is an explicit implementation of a universal computing machine emulating the dynamics of a network of integrate-and-fire neurons; it requires no arithmetic operations except for counting, and decodes neural signals using only computationally inexpensive logic operations. The simplicity of this architecture does not compromise its ability to compress raw neural data by factors greater than . We describe a set of decoding algorithms based on this computational architecture, one designed to operate within an implanted system, minimizing its power consumption and data transmission bandwidth; and a complementary set of algorithms for learning, programming the decoder, and postprocessing the decoded output, designed to operate in an external, nonimplanted unit. The implementation of the implantable portion is estimated to require fewer than 5000 operations per second. A proof-of-concept, 32-channel field-programmable gate array (FPGA) implementation of this portion is consequently energy efficient
NASA Technical Reports Server (NTRS)
Mielke, R.; Stoughton, J.; Som, S.; Obando, R.; Malekpour, M.; Mandala, B.
1990-01-01
A functional description of the ATAMM Multicomputer Operating System is presented. ATAMM (Algorithm to Architecture Mapping Model) is a marked graph model which describes the implementation of large grained, decomposed algorithms on data flow architectures. AMOS, the ATAMM Multicomputer Operating System, is an operating system which implements the ATAMM rules. A first generation version of AMOS which was developed for the Advanced Development Module (ADM) is described. A second generation version of AMOS being developed for the Generic VHSIC Spaceborne Computer (GVSC) is also presented.
NASA Technical Reports Server (NTRS)
Hsia, T. C.; Lu, G. Z.; Han, W. H.
1987-01-01
In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
Adaptive DNA Computing Algorithm by Using PCR and Restriction Enzyme
NASA Astrophysics Data System (ADS)
Kon, Yuji; Yabe, Kaoru; Rajaee, Nordiana; Ono, Osamu
In this paper, we introduce an adaptive DNA computing algorithm by using polymerase chain reaction (PCR) and restriction enzyme. The adaptive algorithm is designed based on Adleman-Lipton paradigm[3] of DNA computing. In this work, however, unlike the Adleman- Lipton architecture a cutting operation has been introduced to the algorithm and the mechanism in which the molecules used by computation were feedback to the next cycle devised. Moreover, the amplification by PCR is performed in the molecule used by feedback and the difference concentration arisen in the base sequence can be used again. By this operation the molecules which serve as a solution candidate can be reduced down and the optimal solution is carried out in the shortest path problem. The validity of the proposed adaptive algorithm is considered with the logical simulation and finally we go on to propose applying adaptive algorithm to the chemical experiment which used the actual DNA molecules for solving an optimal network problem.
Acoustooptic linear algebra processors - Architectures, algorithms, and applications
NASA Technical Reports Server (NTRS)
Casasent, D.
1984-01-01
Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
Accelerating scientific computations with mixed precision algorithms
NASA Astrophysics Data System (ADS)
Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack; Kurzak, Jakub; Langou, Julie; Langou, Julien; Luszczek, Piotr; Tomov, Stanimire
2009-12-01
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summaryProgram title: ITER-REF Catalogue identifier: AECO_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU
Lewis, P.S.
1988-10-01
Least squares techniques are widely used in adaptive signal processing. While algorithms based on least squares are robust and offer rapid convergence properties, they also tend to be complex and computationally intensive. To enable the use of least squares techniques in real-time applications, it is necessary to develop adaptive algorithms that are efficient and numerically stable, and can be readily implemented in hardware. The first part of this work presents a uniform development of general recursive least squares (RLS) algorithms, and multichannel least squares lattice (LSL) algorithms. RLS algorithms are developed for both direct estimators, in which a desired signal is present, and for mixed estimators, in which no desired signal is available, but the signal-to-data cross-correlation is known. In the second part of this work, new and more flexible techniques of mapping algorithms to array architectures are presented. These techniques, based on the synthesis and manipulation of locally recursive algorithms (LRAs), have evolved from existing data dependence graph-based approaches, but offer the increased flexibility needed to deal with the structural complexities of the RLS and LSL algorithms. Using these techniques, various array architectures are developed for each of the RLS and LSL algorithms and the associated space/time tradeoffs presented. In the final part of this work, the application of these algorithms is demonstrated by their employment in the enhancement of single-trial auditory evoked responses in magnetoencephalography. 118 refs., 49 figs., 36 tabs.
Parallel computer graphics algorithms for the Connection Machine
Richardson, J.F.
1990-01-01
Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.
Algorithms and architectures for robot vision
NASA Technical Reports Server (NTRS)
Schenker, Paul S.
1990-01-01
The scope of the current work is to develop practical sensing implementations for robots operating in complex, partially unstructured environments. A focus in this work is to develop object models and estimation techniques which are specific to requirements of robot locomotion, approach and avoidance, and grasp and manipulation. Such problems have to date received limited attention in either computer or human vision - in essence, asking not only how perception is in general modeled, but also what is the functional purpose of its underlying representations. As in the past, researchers are drawing on ideas from both the psychological and machine vision literature. Of particular interest is the development 3-D shape and motion estimates for complex objects when given only partial and uncertain information and when such information is incrementally accrued over time. Current studies consider the use of surface motion, contour, and texture information, with the longer range goal of developing a fused sensing strategy based on these sources and others.
Performance evaluation of the SX-6 vector architecture forscientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan Carter; Shalf,John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri,Jahed; Van der Wijngaart, Rob
2005-01-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor, and compares it against the cache-based IBMPower3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX-6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache-based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Resource utilization model for the algorithm to architecture mapping model
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Patel, Rakesh R.
1993-01-01
The analytical model for resource utilization and the variable node time and conditional node model for the enhanced ATAMM model for a real-time data flow architecture are presented in this research. The Algorithm To Architecture Mapping Model, ATAMM, is a Petri net based graph theoretic model developed at Old Dominion University, and is capable of modeling the execution of large-grained algorithms on a real-time data flow architecture. Using the resource utilization model, the resource envelope may be obtained directly from a given graph and, consequently, the maximum number of required resources may be evaluated. The node timing diagram for one iteration period may be obtained using the analytical resource envelope. The variable node time model, which describes the change in resource requirement for the execution of an algorithm under node time variation, is useful to expand the applicability of the ATAMM model to heterogeneous architectures. The model also describes a method of detecting the presence of resource limited mode and its subsequent prevention. Graphs with conditional nodes are shown to be reduced to equivalent graphs with time varying nodes and, subsequently, may be analyzed using the variable node time model to determine resource requirements. Case studies are performed on three graphs for the illustration of applicability of the analytical theories.
Job Superscheduler Architecture and Performance in Computational Grid Environments
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Oliker, Leonid; Biswas, Rupak
2003-01-01
Computational grids hold great promise in utilizing geographically separated heterogeneous resources to solve large-scale complex scientific problems. However, a number of major technical hurdles, including distributed resource management and effective job scheduling, stand in the way of realizing these gains. In this paper, we propose a novel grid superscheduler architecture and three distributed job migration algorithms. We also model the critical interaction between the superscheduler and autonomous local schedulers. Extensive performance comparisons with ideal, central, and local schemes using real workloads from leading computational centers are conducted in a simulation environment. Additionally, synthetic workloads are used to perform a detailed sensitivity analysis of our superscheduler. Several key metrics demonstrate that substantial performance gains can be achieved via smart superscheduling in distributed computational grids.
Developing a Distributed Computing Architecture at Arizona State University.
ERIC Educational Resources Information Center
Armann, Neil; And Others
1994-01-01
Development of Arizona State University's computing architecture, designed to ensure that all new distributed computing pieces will work together, is described. Aspects discussed include the business rationale, the general architectural approach, characteristics and objectives of the architecture, specific services, and impact on the university…
Algorithms and architectures for high performance analysis of semantic graphs.
Hendrickson, Bruce Alan
2005-09-01
analysis. Since intelligence datasets can be extremely large, the focus of this work is on the use of parallel computers. We have been working to develop scalable parallel algorithms that will be at the core of a semantic graph analysis infrastructure. Our work has involved two different thrusts, corresponding to two different computer architectures. The first architecture of interest is distributed memory, message passing computers. These machines are ubiquitous and affordable, but they are challenging targets for graph algorithms. Much of our distributed-memory work to date has been collaborative with researchers at Lawrence Livermore National Laboratory and has focused on finding short paths on distributed memory parallel machines. Our implementation on 32K processors of BlueGene/Light finds shortest paths between two specified vertices in just over a second for random graphs with 4 billion vertices.
Parallel algorithms for mapping pipelined and parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
NASA Astrophysics Data System (ADS)
Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu
2015-09-01
We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.
Frances: A Tool for Understanding Computer Architecture and Assembly Language
ERIC Educational Resources Information Center
Sondag, Tyler; Pokorny, Kian L.; Rajan, Hridesh
2012-01-01
Students in all areas of computing require knowledge of the computing device including software implementation at the machine level. Several courses in computer science curricula address these low-level details such as computer architecture and assembly languages. For such courses, there are advantages to studying real architectures instead of…
Chae, S.H.
1988-01-01
This dissertation describes an intelligent rule-based iterative parallel algorithm for solving randomly distributed large sparse linear systems of equations, and also the efficient parallel processing computer architecture for the implementation of the algorithm. Implemented with the Jacobi iterative method, the intelligent rule-based algorithm reduces the parallel execution time by reducing the individual inner product operation time. A static dataflow architecture is proposed for implementing the intelligent rule-based iterative parallel algorithm. The dataflow computer architecture has the capability to support parallelism exploited in the algorithm, and the execute the algorithm asynchronously. The proposed computer architecture consists of a main processor, several control processors, scalar slave processors, and pipelined slave processors. Several control processors share with the main processor the heavy burden of allocation of operation packets and of sychronization for parallel processing.
Computer architecture evaluation for structural dynamics computations: Project summary
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1989-01-01
The intent of the proposed effort is the examination of the impact of the elements of parallel architectures on the performance realized in a parallel computation. To this end, three major projects are developed: a language for the expression of high level parallelism, a statistical technique for the synthesis of multicomputer interconnection networks based upon performance prediction, and a queueing model for the analysis of shared memory hierarchies.
Verifying a Computer Algorithm Mathematically.
ERIC Educational Resources Information Center
Olson, Alton T.
1986-01-01
Presents an example of mathematics from an algorithmic point of view, with emphasis on the design and verification of this algorithm. The program involves finding roots for algebraic equations using the half-interval search algorithm. The program listing is included. (JN)
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm
Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En
2015-01-01
A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction. PMID:26287193
Hierarchical Poly Tree computer architectures defined by computational multidisciplinary mechanics
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug; Johnson, Keith
1989-01-01
This paper will develop an alternative computer architecture called the Poly Tree. Based on the requirements of computational mechanics and the concept of hierarchical substructuring, the paper will explore the development of problem-dependent parallel networks of processors which will enable significant, often superlinear, speed enhancements; provide a logical/efficient framework for linear/nonlinear and transient structural mechanics problems; and provide a logical framework from which to apply model reduction procedures. In addition, the paper will explore optimal processor arrangements which define the overall system granularity. Consideration will also be given to system I/O requirements.
A high throughput architecture for a low complexity soft-output demapping algorithm
NASA Astrophysics Data System (ADS)
Ali, I.; Wasenmüller, U.; Wehn, N.
2015-11-01
Iterative channel decoders such as Turbo-Code and LDPC decoders show exceptional performance and therefore they are a part of many wireless communication receivers nowadays. These decoders require a soft input, i.e., the logarithmic likelihood ratio (LLR) of the received bits with a typical quantization of 4 to 6 bits. For computing the LLR values from a received complex symbol, a soft demapper is employed in the receiver. The implementation cost of traditional soft-output demapping methods is relatively large in high order modulation systems, and therefore low complexity demapping algorithms are indispensable in low power receivers. In the presence of multiple wireless communication standards where each standard defines multiple modulation schemes, there is a need to have an efficient demapper architecture covering all the flexibility requirements of these standards. Another challenge associated with hardware implementation of the demapper is to achieve a very high throughput in double iterative systems, for instance, MIMO and Code-Aided Synchronization. In this paper, we present a comprehensive communication and hardware performance evaluation of low complexity soft-output demapping algorithms to select the best algorithm for implementation. The main goal of this work is to design a high throughput, flexible, and area efficient architecture. We describe architectures to execute the investigated algorithms. We implement these architectures on a FPGA device to evaluate their hardware performance. The work has resulted in a hardware architecture based on the figured out best low complexity algorithm delivering a high throughput of 166 Msymbols/second for Gray mapped 16-QAM modulation on Virtex-5. This efficient architecture occupies only 127 slice registers, 248 slice LUTs and 2 DSP48Es.
QPSO-Based Adaptive DNA Computing Algorithm
Karakose, Mehmet; Cigdem, Ugur
2013-01-01
DNA (deoxyribonucleic acid) computing that is a new computation model based on DNA molecules for information storage has been increasingly used for optimization and data analysis in recent years. However, DNA computing algorithm has some limitations in terms of convergence speed, adaptability, and effectiveness. In this paper, a new approach for improvement of DNA computing is proposed. This new approach aims to perform DNA computing algorithm with adaptive parameters towards the desired goal using quantum-behaved particle swarm optimization (QPSO). Some contributions provided by the proposed QPSO based on adaptive DNA computing algorithm are as follows: (1) parameters of population size, crossover rate, maximum number of operations, enzyme and virus mutation rate, and fitness function of DNA computing algorithm are simultaneously tuned for adaptive process, (2) adaptive algorithm is performed using QPSO algorithm for goal-driven progress, faster operation, and flexibility in data, and (3) numerical realization of DNA computing algorithm with proposed approach is implemented in system identification. Two experiments with different systems were carried out to evaluate the performance of the proposed approach with comparative results. Experimental results obtained with Matlab and FPGA demonstrate ability to provide effective optimization, considerable convergence speed, and high accuracy according to DNA computing algorithm. PMID:23935409
A Biomimetic Adaptive Algorithm and Low-Power Architecture for Implantable Neural Decoders
Rapoport, Benjamin I.; Wattanapanitch, Woradorn; Penagos, Hector L.; Musallam, Sam; Andersen, Richard A.; Sarpeshkar, Rahul
2010-01-01
Algorithmically and energetically efficient computational architectures that operate in real time are essential for clinically useful neural prosthetic devices. Such devices decode raw neural data to obtain direct control signals for external devices. They can also perform data compression and vastly reduce the bandwidth and consequently power expended in wireless transmission of raw data from implantable brain-machine interfaces. We describe a biomimetic algorithm and micropower analog circuit architecture for decoding neural cell ensemble signals. The decoding algorithm implements a continuous-time artificial neural network, using a bank of adaptive linear filters with kernels that emulate synaptic dynamics. The filters transform neural signal inputs into control-parameter outputs, and can be tuned automatically in an on-line learning process. We provide experimental validation of our system using neural data from thalamic head-direction cells in an awake behaving rat. PMID:19964345
Algorithm architecture co-design for ultra low-power image sensor
NASA Astrophysics Data System (ADS)
Laforest, T.; Dupret, A.; Verdant, A.; Lattard, D.; Villard, P.
2012-03-01
In a context of embedded video surveillance, stand alone leftbehind image sensors are used to detect events with high level of confidence, but also with a very low power consumption. Using a steady camera, motion detection algorithms based on background estimation to find regions in movement are simple to implement and computationally efficient. To reduce power consumption, the background is estimated using a down sampled image formed of macropixels. In order to extend the class of moving objects to be detected, we propose an original mixed mode architecture developed thanks to an algorithm architecture co-design methodology. This programmable architecture is composed of a vector of SIMD processors. A basic RISC architecture was optimized in order to implement motion detection algorithms with a dedicated set of 42 instructions. Definition of delta modulation as a calculation primitive has allowed to implement algorithms in a very compact way. Thereby, a 1920x1080@25fps CMOS image sensor performing integrated motion detection is proposed with a power estimation of 1.8 mW.
A High Performance COTS Based Computer Architecture
NASA Astrophysics Data System (ADS)
Patte, Mathieu; Grimoldi, Raoul; Trautner, Roland
2014-08-01
Using Commercial Off The Shelf (COTS) electronic components for space applications is a long standing idea. Indeed the difference in processing performance and energy efficiency between radiation hardened components and COTS components is so important that COTS components are very attractive for use in mass and power constrained systems. However using COTS components in space is not straightforward as one must account with the effects of the space environment on the COTS components behavior. In the frame of the ESA funded activity called High Performance COTS Based Computer, Airbus Defense and Space and its subcontractor OHB CGS have developed and prototyped a versatile COTS based architecture for high performance processing. The rest of the paper is organized as follows: in a first section we will start by recapitulating the interests and constraints of using COTS components for space applications; then we will briefly describe existing fault mitigation architectures and present our solution for fault mitigation based on a component called the SmartIO; in the last part of the paper we will describe the prototyping activities executed during the HiP CBC project.
Algorithmic Mechanism Design of Evolutionary Computation
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
Algorithmic Mechanism Design of Evolutionary Computation.
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
A flexible algorithm for calculating pair interactions on SIMD architectures
NASA Astrophysics Data System (ADS)
Páll, Szilárd; Hess, Berk
2013-12-01
Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
NASA Technical Reports Server (NTRS)
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
From Point Clouds to Architectural Models: Algorithms for Shape Reconstruction
NASA Astrophysics Data System (ADS)
Canciani, M.; Falcolini, C.; Saccone, M.; Spadafora, G.
2013-02-01
The use of terrestrial laser scanners in architectural survey applications has become more and more common. Row data complexity, as given by scanner restitution, leads to several problems about design and 3D-modelling starting from Point Clouds. In this context we present a study on architectural sections and mathematical algorithms for their shape reconstruction, according to known or definite geometrical rules, focusing on shapes of different complexity. Each step of the semi-automatic algorithm has been developed using Mathematica software and CAD, integrating both programs in order to reconstruct a geometrical CAD model of the object. Our study is motivated by the fact that, for architectural survey, most of three dimensional modelling procedures concerning point clouds produce superabundant, but often unnecessary, information and are also very expensive in terms of cpu time using more and more sophisticated hardware and software. On the contrary, it's important to simplify/decimate the point cloud in order to recognize a particular form out of some definite geometric/architectonic shapes. Such a process consists of several steps: first the definition of plane sections and characterization of their architecture; secondly the construction of a continuous plane curve depending on some parameters. In the third step we allow the selection on the curve of some nodal points with given specific characteristics (symmetry, tangency conditions, shadowing exclusion, corners, … ). The fourth and last step is the construction of a best shape defined by the comparison with an abacus of known geometrical elements, such as moulding profiles, leading to a precise architectonical section. The algorithms have been developed and tested in very different situations and are presented in a case study of complex geometries such as some mouldings profiles in the Church of San Carlo alle Quattro Fontane.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Architecture and applications of the HEP multiprocessor computer system
Smith, B.J.
1981-01-01
The HEP computer system is a large scale scientific parallel computer employing shared-resource MIMD architecture. The hardware and software facilities provided by the system are described, and techniques found useful in programming the system are discussed. 3 references.
Mutual Algorithm-Architecture Analysis for Real - Parallel Systems in Particle Physics Experiments.
NASA Astrophysics Data System (ADS)
Ni, Ping
Data acquisition from particle colliders requires real-time detection of tracks and energy clusters from collision events occurring at intervals of tens of mus. Beginning with the specification of a benchmark track-finding algorithm, parallel implementations have been developed. A revision of the routing scheme for performing reductions such as a tree sum, called the reduced routing distance scheme, has been developed and analyzed. The scheme reduces inter-PE communication time for narrow communication channel systems. A new parallel algorithm, called the interleaved tree sum, for parallel reduction problems has been developed that increases efficiency of processor use. Detailed analysis of this algorithm with different routing schemes is presented. Comparable parallel algorithms are analyzed, also taking into account the architectural parameters that play an important role in this parallel algorithm analysis. Computation and communication times are analyzed to guide the design of a custom system based on a massively parallel processing component. Developing an optimal system requires mutual analysis of algorithm and architecture parameters. It is shown that matching a processor array size to the parallelism of the problem does not always produce the best system design. Based on promising benchmark simulation results, an application specific hardware prototype board, called Dasher, has been built using two Blitzen chips. The processing array is a mesh-connected SIMD system with 256 PEs. Its design is discussed, with details on the software environment.
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
In part 1 architecture of NETRA is presented. A performance evaluation of NETRA using several common vision algorithms is also presented. Performance of algorithms when they are mapped on one cluster is described. It is shown that SIMD, MIMD, and systolic algorithms can be easily mapped onto processor clusters, and almost linear speedups are possible. For some algorithms, analytical performance results are compared with implementation performance results. It is observed that the analysis is very accurate. Performance analysis of parallel algorithms when mapped across clusters is presented. Mappings across clusters illustrate the importance and use of shared as well as distributed memory in achieving high performance. The parameters for evaluation are derived from the characteristics of the parallel algorithms, and these parameters are used to evaluate the alternative communication strategies in NETRA. Furthermore, the effect of communication interference from other processors in the system on the execution of an algorithm is studied. Using the analysis, performance of many algorithms with different characteristics is presented. It is observed that if communication speeds are matched with the computation speeds, good speedups are possible when algorithms are mapped across clusters.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical
Computing Algorithms for Nuffield Advanced Physics.
ERIC Educational Resources Information Center
Summers, M. K.
1978-01-01
Defines all recurrence relations used in the Nuffield course, to solve first- and second-order differential equations, and describes a typical algorithm for computer generation of solutions. (Author/GA)
Machine Computation; An Algorithmic Approach.
ERIC Educational Resources Information Center
Gonzalez, Richard F.; McMillan, Claude, Jr.
Designed for undergraduate social science students, this textbook concentrates on using the computer in a straightforward way to manipulate numbers and variables in order to solve problems. The text is problem oriented and assumes that the student has had little prior experience with either a computer or programing languages. An introduction to…
NASA Astrophysics Data System (ADS)
Lee, Seungbeom; Lee, Hanho
This paper presents a novel high-speed low-complexity pipelined degree-computationless modified Euclidean (pDCME) algorithm architecture for high-speed RS decoders. The pDCME algorithm allows elimination of the degree-computation so as to reduce hardware complexity and obtain high-speed processing. A high-speed RS decoder based on the pDCME algorithm has been designed and implemented with 0.13-μm CMOS standard cell technology in a supply voltage of 1.1V. The proposed RS decoder operates at a clock frequency of 660MHz and has a throughput of 5.3Gb/s. The proposed architecture requires approximately 15% fewer gate counts and a simpler control logic than architectures based on the popular modified Euclidean algorithm.
Modern hardware architectures accelerate porous media flow computations
NASA Astrophysics Data System (ADS)
Kulczewski, Michal; Kurowski, Krzysztof; Kierzynka, Michal; Dohnalik, Marek; Kaczmarczyk, Jan; Borujeni, Ali Takbiri
2012-05-01
Investigation of rock properties, porosity and permeability particularly, which determines transport media characteristic, is crucial to reservoir engineering. Nowadays, micro-tomography (micro-CT) methods allow to obtain vast of petro-physical properties. The micro-CT method facilitates visualization of pores structures and acquisition of total porosity factor, determined by sticking together 2D slices of scanned rock and applying proper absorption cut-off point. Proper segmentation of pores representation in 3D is important to solve the permeability of porous media. This factor is recently determined by the means of Computational Fluid Dynamics (CFD), a popular method to analyze problems related to fluid flows, taking advantage of numerical methods and constantly growing computing powers. The recent advent of novel multi-, many-core and graphics processing unit (GPU) hardware architectures allows scientists to benefit even more from parallel processing and built-in new features. The high level of parallel scalability offers both, the time-to-solution decrease and greater accuracy - top factors in reservoir engineering. This paper aims to present research results related to fluid flow simulations, particularly solving the total porosity and permeability of porous media, taking advantage of modern hardware architectures. In our approach total porosity is calculated by the means of general-purpose computing on multiple GPUs. This application sticks together 2D slices of scanned rock and by the means of a marching tetrahedra algorithm, creates a 3D representation of pores and calculates the total porosity. Experimental results are compared with data obtained via other popular methods, including Nuclear Magnetic Resonance (NMR), helium porosity and nitrogen permeability tests. Then CFD simulations are performed on a large-scale high performance hardware architecture to solve the flow and permeability of porous media. In our experiments we used Lattice Boltzmann
Algorithms for computing the multivariable stability margin
NASA Technical Reports Server (NTRS)
Tekawy, Jonathan A.; Safonov, Michael G.; Chiang, Richard Y.
1989-01-01
Stability margin for multiloop flight control systems has become a critical issue, especially in highly maneuverable aircraft designs where there are inherent strong cross-couplings between the various feedback control loops. To cope with this issue, we have developed computer algorithms based on non-differentiable optimization theory. These algorithms have been developed for computing the Multivariable Stability Margin (MSM). The MSM of a dynamical system is the size of the smallest structured perturbation in component dynamics that will destabilize the system. These algorithms have been coded and appear to be reliable. As illustrated by examples, they provide the basis for evaluating the robustness and performance of flight control systems.
Performance Analysis of Cloud Computing Architectures Using Discrete Event Simulation
NASA Technical Reports Server (NTRS)
Stocker, John C.; Golomb, Andrew M.
2011-01-01
Cloud computing offers the economic benefit of on-demand resource allocation to meet changing enterprise computing needs. However, the flexibility of cloud computing is disadvantaged when compared to traditional hosting in providing predictable application and service performance. Cloud computing relies on resource scheduling in a virtualized network-centric server environment, which makes static performance analysis infeasible. We developed a discrete event simulation model to evaluate the overall effectiveness of organizations in executing their workflow in traditional and cloud computing architectures. The two part model framework characterizes both the demand using a probability distribution for each type of service request as well as enterprise computing resource constraints. Our simulations provide quantitative analysis to design and provision computing architectures that maximize overall mission effectiveness. We share our analysis of key resource constraints in cloud computing architectures and findings on the appropriateness of cloud computing in various applications.
An events based algorithm for distributing concurrent tasks on multi-core architectures
NASA Astrophysics Data System (ADS)
Holmes, David W.; Williams, John R.; Tilke, Peter
2010-02-01
In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a "hungry" pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.
Cryogenic Control Architecture for Large-Scale Quantum Computing
NASA Astrophysics Data System (ADS)
Hornibrook, J. M.; Colless, J. I.; Conway Lamb, I. D.; Pauka, S. J.; Lu, H.; Gossard, A. C.; Watson, J. D.; Gardner, G. C.; Fallahi, S.; Manfra, M. J.; Reilly, D. J.
2015-02-01
Solid-state qubits have recently advanced to the level that enables them, in principle, to be scaled up into fault-tolerant quantum computers. As these physical qubits continue to advance, meeting the challenge of realizing a quantum machine will also require the development of new supporting devices and control architectures with complexity far beyond the systems used in today's few-qubit experiments. Here, we report a microarchitecture for controlling and reading out qubits during the execution of a quantum algorithm such as an error-correcting code. We demonstrate the basic principles of this architecture using a cryogenic switch matrix implemented via high-electron-mobility transistors and a new kind of semiconductor device based on gate-switchable capacitance. The switch matrix is used to route microwave waveforms to qubits under the control of a field-programmable gate array, also operating at cryogenic temperatures. Taken together, these results suggest a viable approach for controlling large-scale quantum systems using semiconductor technology.
An improved distance matrix computation algorithm for multicore clusters.
Al-Neama, Mohammed W; Reda, Naglaa M; Ghaleb, Fayed F M
2014-01-01
Distance matrix has diverse usage in different research areas. Its computation is typically an essential task in most bioinformatics applications, especially in multiple sequence alignment. The gigantic explosion of biological sequence databases leads to an urgent need for accelerating these computations. DistVect algorithm was introduced in the paper of Al-Neama et al. (in press) to present a recent approach for vectorizing distance matrix computing. It showed an efficient performance in both sequential and parallel computing. However, the multicore cluster systems, which are available now, with their scalability and performance/cost ratio, meet the need for more powerful and efficient performance. This paper proposes DistVect1 as highly efficient parallel vectorized algorithm with high performance for computing distance matrix, addressed to multicore clusters. It reformulates DistVect1 vectorized algorithm in terms of clusters primitives. It deduces an efficient approach of partitioning and scheduling computations, convenient to this type of architecture. Implementations employ potential of both MPI and OpenMP libraries. Experimental results show that the proposed method performs improvement of around 3-fold speedup upon SSE2. Further it also achieves speedups more than 9 orders of magnitude compared to the publicly available parallel implementation utilized in ClustalW-MPI. PMID:25013779
A digital architecture for support vector machines: theory, algorithm, and FPGA implementation.
Anguita, D; Boni, A; Ridella, S
2003-01-01
In this paper, we propose a digital architecture for support vector machine (SVM) learning and discuss its implementation on a field programmable gate array (FPGA). We analyze briefly the quantization effects on the performance of the SVM in classification problems to show its robustness, in the feedforward phase, respect to fixed-point math implementations; then, we address the problem of SVM learning. The architecture described here makes use of a new algorithm for SVM learning which is less sensitive to quantization errors respect to the solution appeared so far in the literature. The algorithm is composed of two parts: the first one exploits a recurrent network for finding the parameters of the SVM; the second one uses a bisection process for computing the threshold. The architecture implementing the algorithm is described in detail and mapped on a real current-generation FPGA (Xilinx Virtex II). Its effectiveness is then tested on a channel equalization problem, where real-time performances are of paramount importance. PMID:18244555
Multithreaded processor architecture for parallel symbolic computation. Technical report
Fujita, T.
1987-09-01
This paper describes the Multilisp Architecture for Symbolic Applications (MASA), which is a multithreaded processor architecture for parallel symbolic computation with various features intended for effective Multilisp program execution. The principal mechanisms exploited for this processor are multiple contexts, interleaved pipeline execution from separate instruction streams, and synchronization based on a bit in each memory cell. The tagged architecture approach is taken for Lisp program execution, and trap conditions are provided for future object manipulation and garbage collection.
Pipeline and parallel architectures for computer communication systems
Reddi, A.V.
1983-01-01
Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
Resource efficient hardware architecture for fast computation of running max/min filters.
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k(2) - 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
A random search algorithm for laboratory computers
NASA Technical Reports Server (NTRS)
Curry, R. E.
1975-01-01
The small laboratory computer is ideal for experimental control and data acquisition. Postexperimental data processing is often performed on large computers because of the availability of sophisticated programs, but costs and data compatibility are negative factors. Parameter optimization can be accomplished on the small computer, offering ease of programming, data compatibility, and low cost. A previously proposed random-search algorithm ('random creep') was found to be very slow in convergence. A method is proposed (the 'random leap' algorithm) which starts in a global search mode and automatically adjusts step size to speed convergence. A FORTRAN executive program for the random-leap algorithm is presented which calls a user-supplied function subroutine. An example of a function subroutine is given which calculates maximum-likelihood estimates of receiver operating-characteristic parameters from binary response data. Other applications in parameter estimation, generalized least squares, and matrix inversion are discussed.
A new hardware-efficient algorithm and reconfigurable architecture for image contrast enhancement.
Huang, Shih-Chia; Chen, Wen-Chieh
2014-10-01
Contrast enhancement is crucial when generating high quality images for image processing applications, such as digital image or video photography, liquid crystal display processing, and medical image analysis. In order to achieve real-time performance for high-definition video applications, it is necessary to design efficient contrast enhancement hardware architecture to meet the needs of real-time processing. In this paper, we propose a novel hardware-oriented contrast enhancement algorithm which can be implemented effectively for hardware design. In order to be considered for hardware implementation, approximation techniques are proposed to reduce these complex computations during performance of the contrast enhancement algorithm. The proposed hardware-oriented contrast enhancement algorithm achieves good image quality by measuring the results of qualitative and quantitative analyzes. To decrease hardware cost and improve hardware utilization for real-time performance, a reduction in circuit area is proposed through use of parameter-controlled reconfigurable architecture. The experiment results show that the proposed hardware-oriented contrast enhancement algorithm can provide an average frame rate of 48.23 frames/s at high definition resolution 1920 × 1080. PMID:25148665
Parallel algorithms for computing linked list prefix
Han, Y. )
1989-06-01
Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.
Integrated computer control system architectural overview
Van Arsdall, P.
1997-06-18
This overview introduces the NIF Integrated Control System (ICCS) architecture. The design is abstract to allow the construction of many similar applications from a common framework. This summary lays the essential foundation for understanding the model-based engineering approach used to execute the design.
Implementation of the FDK algorithm for cone-beam CT on the cell broadband engine architecture
NASA Astrophysics Data System (ADS)
Scherl, Holger; Koerner, Mario; Hofmann, Hannes; Eckert, Wieland; Kowarschik, Markus; Hornegger, Joachim
2007-03-01
In most of today's commercially available cone-beam CT scanners, the well known FDK method is used for solving the 3D reconstruction task. The computational complexity of this algorithm prohibits its use for many medical applications without hardware acceleration. The brand-new Cell Broadband Engine Architecture (CBEA) with its high level of parallelism is a cost-efficient processor for performing the FDK reconstruction according to the medical requirements. The programming scheme, however, is quite different to any standard personal computer hardware. In this paper, we present an innovative implementation of the most time-consuming parts of the FDK algorithm: filtering and back-projection. We also explain the required transformations to parallelize the algorithm for the CBEA. Our software framework allows to compute the filtering and back-projection in parallel, making it possible to do an on-the-fly-reconstruction. The achieved results demonstrate that a complete FDK reconstruction is computed with the CBEA in less than seven seconds for a standard clinical scenario. Given the fact that scan times are usually much higher, we conclude that reconstruction is finished right after the end of data acquisition. This enables us to present the reconstructed volume to the physician in real-time, immediately after the last projection image has been acquired by the scanning device.
A computational architecture for social agents
Bond, A.H.
1996-12-31
This article describes a new class of information-processing models for social agents. They axe derived from primate brain architecture, the processing in brain regions, the interactions among brain regions, and the social behavior of primates. In another paper, we have reviewed the neuroanatomical connections and functional involvements of cortical regions. We reviewed the evidence for a hierarchical architecture in the primate brain. By examining neuroanatomical evidence for connections among neural areas, we were able to establish anatomical regions and connections. We then examined evidence for specific functional involvements of the different neural axeas and found some support for hierarchical functioning, not only for the perception hierarchies but also for the planning and action hierarchy in the frontal lobes.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
A Simple Physical Optics Algorithm Perfect for Parallel Computing
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1993-01-01
One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Computational algorithms for simulations in atmospheric optics.
Konyaev, P A; Lukin, V P
2016-04-20
A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors. PMID:27140113
Middleware in Modern High Performance Computing System Architectures
Engelmann, Christian; Ong, Hong Hoe; Scott, Stephen L
2007-01-01
A recent trend in modern high performance computing (HPC) system architectures employs ''lean'' compute nodes running a lightweight operating system (OS). Certain parts of the OS as well as other system software services are moved to service nodes in order to increase performance and scalability. This paper examines the impact of this HPC system architecture trend on HPC ''middleware'' software solutions, which traditionally equip HPC systems with advanced features, such as parallel and distributed programming models, appropriate system resource management mechanisms, remote application steering and user interaction techniques. Since the approach of keeping the compute node software stack small and simple is orthogonal to the middleware concept of adding missing OS features between OS and application, the role and architecture of middleware in modern HPC systems needs to be revisited. The result is a paradigm shift in HPC middleware design, where single middleware services are moved to service nodes, while runtime environments (RTEs) continue to reside on compute nodes.
Computer Architecture. (Latest Citations from the Aerospace Database)
NASA Technical Reports Server (NTRS)
1996-01-01
The bibliography contains citations concerning research and development in the field of computer architecture. Design of computer systems, microcomputer components, and digital networks are among the topics discussed. Multimicroprocessor system performance, software development, and aerospace avionics applications are also included. (Contains 50-250 citations and includes a subject term index and title list.)
The Contribution of Visualization to Learning Computer Architecture
ERIC Educational Resources Information Center
Yehezkel, Cecile; Ben-Ari, Mordechai; Dreyfus, Tommy
2007-01-01
This paper describes a visualization environment and associated learning activities designed to improve learning of computer architecture. The environment, EasyCPU, displays a model of the components of a computer and the dynamic processes involved in program execution. We present the results of a research program that analysed the contribution of…
Architecture and applications of the HEP multiprocessor computer system
Smith, B.J.; Fink, D.J.
1982-01-01
The HEP computer system is a large scale scientific parallel computer employing shared resource MIMD architecture. The hardware and software facilities provided by the system are described, and techniques found to be useful in programming the system are also discussed. 3 references.
Fault tolerant hypercube computer system architecture
NASA Technical Reports Server (NTRS)
Madan, Herb S. (Inventor); Chow, Edward (Inventor)
1989-01-01
A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node
Algorithm-dependent fault tolerance for distributed computing
P. D. Hough; M. e. Goldsby; E. J. Walsh
2000-02-01
Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.
Architectures of Kepler Planet Systems with Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Morehead, Robert C.; Ford, Eric B.
2015-12-01
The distribution of period normalized transit duration ratios among Kepler’s multiple transiting planet systems constrains the distributions of mutual orbital inclinations and orbital eccentricities. However, degeneracies in these parameters tied to the underlying number of planets in these systems complicate their interpretation. To untangle the true architecture of planet systems, the mutual inclination, eccentricity, and underlying planet number distributions must be considered simultaneously. The complexities of target selection, transit probability, detection biases, vetting, and follow-up observations make it impractical to write an explicit likelihood function. Approximate Bayesian computation (ABC) offers an intriguing path forward. In its simplest form, ABC generates a sample of trial population parameters from a prior distribution to produce synthetic datasets via a physically-motivated forward model. Samples are then accepted or rejected based on how close they come to reproducing the actual observed dataset to some tolerance. The accepted samples form a robust and useful approximation of the true posterior distribution of the underlying population parameters. We build on the considerable progress from the field of statistics to develop sequential algorithms for performing ABC in an efficient and flexible manner. We demonstrate the utility of ABC in exoplanet populations and present new constraints on the distributions of mutual orbital inclinations, eccentricities, and the relative number of short-period planets per star. We conclude with a discussion of the implications for other planet occurrence rate calculations, such as eta-Earth.
Architecture independent environment for developing engineering software on MIMD computers
NASA Technical Reports Server (NTRS)
Valimohamed, Karim A.; Lopez, L. A.
1990-01-01
Engineers are constantly faced with solving problems of increasing complexity and detail. Multiple Instruction stream Multiple Data stream (MIMD) computers have been developed to overcome the performance limitations of serial computers. The hardware architectures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. First, the issues that must be considered to develop those tools are examined. The two main areas of concern were architecture independence and data management. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the investment of time and effort that goes into developing the software. The management of data is a crucial aspect of solving large engineering problems. It must be considered in light of the new hardware organizations that are available. Second, the functional design and implementation of a software environment that facilitates developing architecture independent software for large engineering applications are described. The topics of discussion include: a description of the model that supports the development of architecture independent software; identifying and exploiting concurrency within the application program; data coherence; engineering data base and memory management.
Thrifty: An Exascale Architecture for Energy Proportional Computing
Torrellas, Josep
2014-12-23
The objective of this project is to design key aspects of an exascale architecture called Thrifty that addresses the challenges of power/energy efficiency, resiliency, and performance in exascale systems. The project includes work on computer architecture (Josep Torrellas from University of Illinois), compilation (Daniel Quinlan from Lawrence Livermore National Laboratory), runtime and applications (Laura Carrington from University of California San Diego), and circuits (Wilfred Pinfold from Intel Corporation).
Heavy Lift Vehicle (HLV) Avionics Flight Computing Architecture Study
NASA Technical Reports Server (NTRS)
Hodson, Robert F.; Chen, Yuan; Morgan, Dwayne R.; Butler, A. Marc; Sdhuh, Joseph M.; Petelle, Jennifer K.; Gwaltney, David A.; Coe, Lisa D.; Koelbl, Terry G.; Nguyen, Hai D.
2011-01-01
A NASA multi-Center study team was assembled from LaRC, MSFC, KSC, JSC and WFF to examine potential flight computing architectures for a Heavy Lift Vehicle (HLV) to better understand avionics drivers. The study examined Design Reference Missions (DRMs) and vehicle requirements that could impact the vehicles avionics. The study considered multiple self-checking and voting architectural variants and examined reliability, fault-tolerance, mass, power, and redundancy management impacts. Furthermore, a goal of the study was to develop the skills and tools needed to rapidly assess additional architectures should requirements or assumptions change.
Iterative algorithms for large sparse linear systems on parallel computers
NASA Technical Reports Server (NTRS)
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Aerodynamic optimization studies on advanced architecture computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana
1995-01-01
The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures
Datta, Kaushik; Murphy, Mark; Volkov, Vasily; Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Patterson, David; Shalf, John; Yelick, Katherine
2008-08-22
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.
Fast computation algorithms for speckle pattern simulation
Nascov, Victor; Samoilă, Cornel; Ursuţiu, Doru
2013-11-13
We present our development of a series of efficient computation algorithms, generally usable to calculate light diffraction and particularly for speckle pattern simulation. We use mainly the scalar diffraction theory in the form of Rayleigh-Sommerfeld diffraction formula and its Fresnel approximation. Our algorithms are based on a special form of the convolution theorem and the Fast Fourier Transform. They are able to evaluate the diffraction formula much faster than by direct computation and we have circumvented the restrictions regarding the relative sizes of the input and output domains, met on commonly used procedures. Moreover, the input and output planes can be tilted each to other and the output domain can be off-axis shifted.
Algorithms for parallel flow solvers on message passing architectures
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
LINCS: Livermore's network architecture. [Octopus computing network
Fletcher, J.G.
1982-01-01
Octopus, a local computing network that has been evolving at the Lawrence Livermore National Laboratory for over fifteen years, is currently undergoing a major revision. The primary purpose of the revision is to consolidate and redefine the variety of conventions and formats, which have grown up over the years, into a single standard family of protocols, the Livermore Interactive Network Communication Standard (LINCS). This standard treats the entire network as a single distributed operating system such that access to a computing resource is obtained in a single way, whether that resource is local (on the same computer as the accessing process) or remote (on another computer). LINCS encompasses not only communication but also such issues as the relationship of customer to server processes and the structure, naming, and protection of resources. The discussion includes: an overview of the Livermore user community and computing hardware, the functions and structure of each of the seven layers of LINCS protocol, the reasons why we have designed our own protocols and why we are dissatisfied by the directions that current protocol standards are taking.
Computational architecture for integrated controls and structures design
NASA Technical Reports Server (NTRS)
Belvin, W. Keith; Park, K. C.
1989-01-01
To facilitate the development of control structure interaction (CSI) design methodology, a computational architecture for interdisciplinary design of active structures is presented. The emphasis of the computational procedure is to exploit existing sparse matrix structural analysis techniques, in-core data transfer with control synthesis programs, and versatility in the optimization methodology to avoid unnecessary structural or control calculations. The architecture is designed such that all required structure, control and optimization analyses are performed within one program. Hence, the optimization strategy is not unduly constrained by cold starts of existing structural analysis and control synthesis packages.
Layered Architectures for Quantum Computers and Quantum Repeaters
NASA Astrophysics Data System (ADS)
Jones, Nathan C.
This chapter examines how to organize quantum computers and repeaters using a systematic framework known as layered architecture, where machine control is organized in layers associated with specialized tasks. The framework is flexible and could be used for analysis and comparison of quantum information systems. To demonstrate the design principles in practice, we develop architectures for quantum computers and quantum repeaters based on optically controlled quantum dots, showing how a myriad of technologies must operate synchronously to achieve fault-tolerance. Optical control makes information processing in this system very fast, scalable to large problem sizes, and extendable to quantum communication.
Panel on future directions in parallel computer architecture
VanTilborg, A.M. )
1989-06-01
One of the program highlights of the 15th Annual International Symposium on Computer Architecture, held May 30 - June 2, 1988 in Honolulu, was a panel session on future directions in parallel computer architecture. The panel was organized and chaired by the author, and was comprised of Prof. Jack Dennis (NASA Ames Research Institute for Advanced Computer Science), Prof. H.T. Kung (Carnegie Mellon), and Dr. Burton Smith (Tera Computer Company). The objective of the panel was to identify the likely trajectory of future parallel computer system progress, particularly from the sandpoint of marketplace acceptance. Approximately 250 attendees participated in the session, in which each panelist began with a ten minute viewgraph explanation of his views, followed by an open and sometimes lively exchange with the audience and fellow panelists. The session ran for ninety minutes.
Architecture and grid application of cluster computing system
NASA Astrophysics Data System (ADS)
Lv, Yi; Yu, Shuiqin; Mao, Youju
2004-11-01
Recently, people pay more attention to the grid technology. It can not only connect all kinds of resources in the network, but also put them into a super transparent computing environment for customers to realize mete-computing which can share computing resources. Traditional parallel computing system, such as SMP(Symmetrical multiprocessor) and MPP(massively parallel processor), use multi-processors to raise computing speed in a close coupling way, so the flexible and scalable performance of the system are limited, as a result of it, the system can't meet the requirement of the grid technology. In this paper, the architecture of cluster computing system applied in grid nodes is introduced. It mainly includes the following aspects. First, the network architecture of cluster computing system in grid nodes is analyzed and designed. Second, how to realize distributing computing (including coordinating computing and sharing computing) of cluster computing system in grid nodes to construct virtual node computers is discussed. Last, communication among grid nodes is analyzed. In other words, it discusses how to realize single reflection to let all the service requirements from customers be met through sending to the grid nodes.
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable
Schuller, Ivan K.; Stevens, Rick; Pino, Robinson; Pechan, Michael
2015-10-29
Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS based technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.
NASA Astrophysics Data System (ADS)
Trigueros-Espinosa, Blas; Vélez-Reyes, Miguel; Santiago-Santiago, Nayda G.; Rosario-Torres, Samuel
2011-06-01
Hyperspectral sensors can collect hundreds of images taken at different narrow and contiguously spaced spectral bands. This high-resolution spectral information can be used to identify materials and objects within the field of view of the sensor by their spectral signature, but this process may be computationally intensive due to the large data sizes generated by the hyperspectral sensors, typically hundreds of megabytes. This can be an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In this work, we developed a parallel implementation of three state-ofthe- art target detection algorithms (RX algorithm, matched filter and adaptive matched subspace detector) using a graphics processing unit (GPU) based on the NVIDIA® CUDA™ architecture. In addition, a multi-core CPUbased implementation of each algorithm was developed to be used as a baseline for the speedups estimation. We evaluated the performance of the GPU-based implementations using an NVIDIA ® Tesla® C1060 GPU card, and the detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU-based implementation, which suggests that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware.
A single user efficiency measure for evaluation of parallel or pipeline computer architectures
NASA Technical Reports Server (NTRS)
Jones, W. P.
1978-01-01
A precise statement of the relationship between sequential computation at one rate, parallel or pipeline computation at a much higher rate, the data movement rate between levels of memory, the fraction of inherently sequential operations or data that must be processed sequentially, the fraction of data to be moved that cannot be overlapped with computation, and the relative computational complexity of the algorithms for the two processes, scalar and vector, was developed. The relationship should be applied to the multirate processes that obtain in the employment of various new or proposed computer architectures for computational aerodynamics. The relationship, an efficiency measure that the single user of the computer system perceives, argues strongly in favor of separating scalar and vector processes, sometimes referred to as loosely coupled processes, to achieve optimum use of hardware.
Parallel algorithms for optical digital computers
Huang, A.
1983-01-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.
A computationally efficient particle-simulation method suited to vector-computer architectures
McDonald, J.D.
1990-01-01
Recent interest in a National Aero-Space Plane (NASP) and various Aero-assisted Space Transfer Vehicles (ASTVs) presents the need for a greater understanding of high-speed rarefied flight conditions. Particle simulation techniques such as the Direct Simulation Monte Carlo (DSMC) method are well suited to such problems, but the high cost of computation limits the application of the methods to two-dimensional or very simple three-dimensional problems. This research re-examines the algorithmic structure of existing particle simulation methods and re-structures them to allow efficient implementation on vector-oriented supercomputers. A brief overview of the DSMC method and the Cray-2 vector computer architecture are provided, and the elements of the DSMC method that inhibit substantial vectorization are identified. One such element is the collision selection algorithm. A complete reformulation of underlying kinetic theory shows that this may be efficiently vectorized for general gas mixtures. The mechanics of collisions are vectorizable in the DSMC method, but several optimizations are suggested that greatly enhance performance. Also this thesis proposes a new mechanism for the exchange of energy between vibration and other energy modes. The developed scheme makes use of quantized vibrational states and is used in place of the Borgnakke-Larsen model. Finally, a simplified representation of physical space and boundary conditions is utilized to further reduce the computational cost of the developed method. Comparison to solutions obtained from the DSMC method for the relaxation of internal energy modes in a homogeneous gas, as well as single and multiple specie shock wave profiles, are presented. Additionally, a large scale simulation of the flow about the proposed Aeroassisted Flight Experiment (AFE) vehicle is included as an example of the new computational capability of the developed particle simulation method.
Fast search algorithms for computational protein design.
Traoré, Seydou; Roberts, Kyle E; Allouche, David; Donald, Bruce R; André, Isabelle; Schiex, Thomas; Barbe, Sophie
2016-05-01
One of the main challenges in computational protein design (CPD) is the huge size of the protein sequence and conformational space that has to be computationally explored. Recently, we showed that state-of-the-art combinatorial optimization technologies based on Cost Function Network (CFN) processing allow speeding up provable rigid backbone protein design methods by several orders of magnitudes. Building up on this, we improved and injected CFN technology into the well-established CPD package Osprey to allow all Osprey CPD algorithms to benefit from associated speedups. Because Osprey fundamentally relies on the ability of A* to produce conformations in increasing order of energy, we defined new A* strategies combining CFN lower bounds, with new side-chain positioning-based branching scheme. Beyond the speedups obtained in the new A*-CFN combination, this novel branching scheme enables a much faster enumeration of suboptimal sequences, far beyond what is reachable without it. Together with the immediate and important speedups provided by CFN technology, these developments directly benefit to all the algorithms that previously relied on the DEE/ A* combination inside Osprey* and make it possible to solve larger CPD problems with provable algorithms. PMID:26833706
Parallel algorithms for computational geometry utilizing a fixed number of processors
Strader, R.G.
1988-01-01
The design of algorithms for systems where both communication and computation are important is presented. Approaches to parallel computation and the underlying theoretical models are surveyed. two models of computation are developed, both based on a divide-and-conquer strategy. The first utilizes a tree-like merge resulting in several levels of communication and computation, the total number determined by the number of processors. The second model contains a fixed number of levels independent of the number of processors. Using the notation from the survey and the models of computation, algorithms are designed for the computational geometry problems of finding the convex hull and Delaunay triangulation for a set of uniform random points in the Euclidean plane. Communication and computation timing measurements based on these algorithms are presented and analyzed. The results are then generalized to predict the behavior of expanded problems. Architectural support, partitioning issues, and limitations of this approach are discussed.
Pipelined CPU Design with FPGA in Teaching Computer Architecture
ERIC Educational Resources Information Center
Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon
2012-01-01
This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
In-Memory Computing Architectures for Sparse Distributed Memory.
Kang, Mingu; Shanbhag, Naresh R
2016-08-01
This paper presents an energy-efficient and high-throughput architecture for Sparse Distributed Memory (SDM)-a computational model of the human brain [1]. The proposed SDM architecture is based on the recently proposed in-memory computing kernel for machine learning applications called Compute Memory (CM) [2], [3]. CM achieves energy and throughput efficiencies by deeply embedding computation into the memory array. SDM-specific techniques such as hierarchical binary decision (HBD) are employed to reduce the delay and energy further. The CM-based SDM (CM-SDM) is a mixed-signal circuit, and hence circuit-aware behavioral, energy, and delay models in a 65 nm CMOS process are developed in order to predict system performance of SDM architectures in the auto- and hetero-associative modes. The delay and energy models indicate that CM-SDM, in general, can achieve up to 25 × and 12 × delay and energy reduction, respectively, over conventional SDM. When classifying 16 × 16 binary images with high noise levels (input bad pixel ratios: 15%-25%) into nine classes, all SDM architectures are able to generate output bad pixel ratios (Bo) ≤ 2%. The CM-SDM exhibits negligible loss in accuracy, i.e., its Bo degradation is within 0.4% as compared to that of the conventional SDM. PMID:27305686
Computational algorithms to predict Gene Ontology annotations
2015-01-01
Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper
VLSI architectures for computing multiplications and inverses in GF(2m).
Wang, C C; Truong, T K; Shao, H M; Deutsch, L J; Omura, J K; Reed, I S
1985-08-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation. PMID:11539660
VLSI architectures for computing multiplications and inverses in GF(2-m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.; Reed, I. S.
1983-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
VLSI architectures for computing multiplications and inverses in GF(2m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.
1985-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
Efficient Algorithm and Architecture of Critical-Band Transform for Low-Power Speech Applications
NASA Astrophysics Data System (ADS)
Wang, Chao; Gan, Woon-Seng
2007-12-01
An efficient algorithm and its corresponding VLSI architecture for the critical-band transform (CBT) are developed to approximate the critical-band filtering of the human ear. The CBT consists of a constant-bandwidth transform in the lower frequency range and a Brown constant-[InlineEquation not available: see fulltext.] transform (CQT) in the higher frequency range. The corresponding VLSI architecture is proposed to achieve significant power efficiency by reducing the computational complexity, using pipeline and parallel processing, and applying the supply voltage scaling technique. A 21-band Bark scale CBT processor with a sampling rate of 16 kHz is designed and simulated. Simulation results verify its suitability for performing short-time spectral analysis on speech. It has a better fitting on the human ear critical-band analysis, significantly fewer computations, and therefore is more energy-efficient than other methods. With a 0.35[InlineEquation not available: see fulltext.]m CMOS technology, it calculates a 160-point speech in 4.99 milliseconds at 234 kHz. The power dissipation is 15.6[InlineEquation not available: see fulltext.]W at 1.1 V. It achieves 82.1[InlineEquation not available: see fulltext.] power reduction as compared to a benchmark 256-point FFT processor.
Quantum computations: algorithms and error correction
NASA Astrophysics Data System (ADS)
Kitaev, A. Yu
1997-12-01
Contents §0. Introduction §1. Abelian problem on the stabilizer §2. Classical models of computations2.1. Boolean schemes and sequences of operations2.2. Reversible computations §3. Quantum formalism3.1. Basic notions and notation3.2. Transformations of mixed states3.3. Accuracy §4. Quantum models of computations4.1. Definitions and basic properties4.2. Construction of various operators from the elements of a basis4.3. Generalized quantum control and universal schemes §5. Measurement operators §6. Polynomial quantum algorithm for the stabilizer problem §7. Computations with perturbations: the choice of a model §8. Quantum codes (definitions and general properties)8.1. Basic notions and ideas8.2. One-to-one codes8.3. Many-to-one codes §9. Symplectic (additive) codes9.1. Algebraic preparation9.2. The basic construction9.3. Error correction procedure9.4. Torus codes §10. Error correction in the computation process: general principles10.1. Definitions and results10.2. Proofs §11. Error correction: concrete procedures11.1. The symplecto-classical case11.2. The case of a complete basis Bibliography
Block sparse Cholesky algorithms on advanced uniprocessor computers
Ng, E.G.; Peyton, B.W.
1991-12-01
As with many other linear algebra algorithms, devising a portable implementation of sparse Cholesky factorization that performs well on the broad range of computer architectures currently available is a formidable challenge. Even after limiting our attention to machines with only one processor, as we have done in this report, there are still several interesting issues to consider. For dense matrices, it is well known that block factorization algorithms are the best means of achieving this goal. We take this approach for sparse factorization as well. This paper has two primary goals. First, we examine two sparse Cholesky factorization algorithms, the multifrontal method and a blocked left-looking sparse Cholesky method, in a systematic and consistent fashion, both to illustrate the strengths of the blocking techniques in general and to obtain a fair evaluation of the two approaches. Second, we assess the impact of various implementation techniques on time and storage efficiency, paying particularly close attention to the work-storage requirement of the two methods and their variants.
Algorithmic cooling and scalable NMR quantum computers
Boykin, P. Oscar; Mor, Tal; Roychowdhury, Vwani; Vatan, Farrokh; Vrijen, Rutger
2002-01-01
We present here algorithmic cooling (via polarization heat bath)—a powerful method for obtaining a large number of highly polarized spins in liquid nuclear-spin systems at finite temperature. Given that spin-half states represent (quantum) bits, algorithmic cooling cleans dirty bits beyond the Shannon's bound on data compression, by using a set of rapidly thermal-relaxing bits. Such auxiliary bits could be implemented by using spins that rapidly get into thermal equilibrium with the environment, e.g., electron spins. Interestingly, the interaction with the environment, usually a most undesired interaction, is used here to our benefit, allowing a cooling mechanism. Cooling spins to a very low temperature without cooling the environment could lead to a breakthrough in NMR experiments, and our “spin-refrigerating” method suggests that this is possible. The scaling of NMR ensemble computers is currently one of the main obstacles to building larger-scale quantum computing devices, and our spin-refrigerating method suggests that this problem can be resolved. PMID:11904402
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Woodruff, S.B.
1992-05-01
The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems.
Portable computer system architecture for the Space Station Freedom program
NASA Technical Reports Server (NTRS)
Alena, Richard; Liu, Yuan-Kwei; Fernquist, Alan R.
1993-01-01
This paper outlines various mission requirements and technical approaches that support the potential use of portable computers in several defined activities within the Space Station Freedom (SSF) program. Specifically, the use of portable computers as consoles for both spacecraft control and payload applications is presented. Various issues and proposed solutions regarding the incorporation of portable computers within the program are presented. The primary issues presented regard architecture (standard interface for expansion, advanced processors and displays), integration (methods of high-speed data communication, peripheral interfaces, and interconnectivity within various support networks), and evolution (wireless communications and multimedia data interface methods).
Reviews of computing technology: A review of compound document architectures
Hudson, B.J.
1991-10-01
This review of computing technology will define, describe, and give examples of various approaches to document management through the use of compound document architectures. Experts agree that only 10% of business information exists in machine readable form, but much of what is stored is not in useful form. As a result, the average business document is copied over a dozen times during its life and duplicate copies are stored in numerous locations. The goal of compound document architectures is to provide an information support environment where rapid access to the correct information in the proper format is simplified. A compound document architecture provides structure to seemingly unstructured electronic documents, and standardizes the methods for interchange and access of entire or partial documents by authors and users.
Architecture Framework for Trapped-Ion Quantum Computer based on Performance Simulation Tool
NASA Astrophysics Data System (ADS)
Ahsan, Muhammad
The challenge of building scalable quantum computer lies in striking appropriate balance between designing a reliable system architecture from large number of faulty computational resources and improving the physical quality of system components. The detailed investigation of performance variation with physics of the components and the system architecture requires adequate performance simulation tool. In this thesis we demonstrate a software tool capable of (1) mapping and scheduling the quantum circuit on a realistic quantum hardware architecture with physical resource constraints, (2) evaluating the performance metrics such as the execution time and the success probability of the algorithm execution, and (3) analyzing the constituents of these metrics and visualizing resource utilization to identify system components which crucially define the overall performance. Using this versatile tool, we explore vast design space for modular quantum computer architecture based on trapped ions. We find that while success probability is uniformly determined by the fidelity of physical quantum operation, the execution time is a function of system resources invested at various layers of design hierarchy. At physical level, the number of lasers performing quantum gates, impact the latency of the fault-tolerant circuit blocks execution. When these blocks are used to construct meaningful arithmetic circuit such as quantum adders, the number of ancilla qubits for complicated non-clifford gates and entanglement resources to establish long-distance communication channels, become major performance limiting factors. Next, in order to factorize large integers, these adders are assembled into modular exponentiation circuit comprising bulk of Shor's algorithm. At this stage, the overall scaling of resource-constraint performance with the size of problem, describes the effectiveness of chosen design. By matching the resource investment with the pace of advancement in hardware technology
A pipelined IC architecture for radon transform computations in a multiprocessor array
Agi, I.; Hurst, P.J.; Current, K.W. . Dept. of Electrical Engineering and Computer Science)
1990-05-25
The amount of data generated by CT scanners is enormous, making the reconstruction operation slow, especially for 3-D and limited-data scans requiring iterative algorithms. The Radon transform and its inverse, commonly used for CT image reconstruction from projections, are computationally burdensome for today's single-processor computer architectures. If the processing times for the forward and inverse Radon transforms were comparatively small, a large set of new CT algorithms would become feasible, especially those for 3-D and iterative tomographic image reconstructions. In addition to image reconstruction, a fast Radon Transform Computer'' could be naturally applied in other areas of multidimensional signal processing including 2-D power spectrum estimation, modeling of human perception, Hough transforms, image representation, synthetic aperture radar processing, and others. A high speed processor for this operation is likely to motivate new algorithms for general multidimensional signal processing using the Radon transform. In the proposed workshop paper, we will first describe interpolation schemes useful in computation of the discrete Radon transform and backprojection and compare their errors and hardware complexities. We then will evaluate through statistical means the fixed-point number system required to accept and generate 12-bit input and output data with acceptable error using the linear interpolation scheme selected. These results set some of the requirements that must be met by our new VLSI chip architecture. Finally we will present a new unified architecture for a single-chip processor for computing both the forward Radon transform and backprojection at high data rates. 3 refs., 2 figs.
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
QNIX: A Linear Optical Architecture for Quantum Computing
NASA Astrophysics Data System (ADS)
Gimeno-Segovia, Mercedes; Shadbolt, Peter J.; Rudolph, Terry G.; Browne, Dan E.; Mendoza, Gabriel; Russell, Nicholas J.; Silverstone, Joshua W.; Santamato, Alberto; Carolan, Jacques; O'Brien, Jeremy
2015-03-01
There is currently a great deal of effort to develop a large-scale quantum computer, and one of the most promising platforms to do so is integrated linear optics. We present a proposal for a dynamical scheme for an integrated linear optics implementation of a one-way quantum computer. We go beyond the purely theoretical work and address practical issues in order to create a physically realistic design. We describe every step of cluster state construction and processing, showing the outstanding issues left to be addressed and our contributions to the different stages of the dynamical process. These include optimised interferometers for the generation of GHZ states, a universal and scalable architecture which requires entangled sources of no more than 3 photons with no active feed-forward, and loss-tolerant and fault-tolerant strategies specifically tailored to our proposed architecture. Our work demonstrates that building a linear optical quantum computer need be less challenging than previously thought, and brings large-scale switch-free linear optical architectures for quantum computing much closer to experimental realisation.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
NASA Astrophysics Data System (ADS)
da Silva, C. R.; da Silveira, P.; Wentzcovitch, R. M.; Pierce, M.; Erlebacher, G.
2007-12-01
We present an overview of the VLab, a system developed to handle execution of extensive workflows generated by first principles computations of thermoelastic properties of minerals. The multiplicity (102-3) of tasks derives from sampling of parameter space with variables such as pressure, temperature, strain, composition, etc. We review the algorithms of physical importance that define the system's requirements, its underlying service oriented architecture (SOA), and metadata. The system architecture emerges naturally. The SOA is a collection of web-services providing access to distributed computing nodes, controlling workflow execution, monitoring services, and providing data analyses tools, visualization services, data bases, and authentication services. A usage view diagram is described. We also show snapshots taken from the actual operational procedure in VLab. Research supported by NSF/ITR (VLab)
VLab: A Service Oriented Architecture for Distributed First Principles Materials Computations
NASA Astrophysics Data System (ADS)
da Silva, Cesar; da Silveira, Pedro; Wentzcovitch, Renata; Pierce, Marlon; Erlebacher, Gordon
2008-03-01
We present an overview of VLab, a system developed to handle execution of extensive workflows generated by first principles computations of thermoelastic properties of minerals. The multiplicity (10^2-3) of tasks derives from sampling of parameter space with variables such as pressure, temperature, strain, composition, etc. We review the algorithms of physical importance that define the system's requirements, its underlying service oriented architecture (SOA), and metadata. The system architecture emerges naturally. The SOA is a collection of web-services providing access to distributed computing nodes, workflow control, and monitoring services, and providing data analysis tools, visualization services, data bases, and authentication services. A usage view diagram is described. We also show snapshots taken from the actual operational procedure in VLab.
Nanotube devices based crossbar architecture: toward neuromorphic computing.
Zhao, W S; Agnus, G; Derycke, V; Filoramo, A; Bourgoin, J-P; Gamrat, C
2010-04-30
Nanoscale devices such as carbon nanotube and nanowires based transistors, memristors and molecular devices are expected to play an important role in the development of new computing architectures. While their size represents a decisive advantage in terms of integration density, it also raises the critical question of how to efficiently address large numbers of densely integrated nanodevices without the need for complex multi-layer interconnection topologies similar to those used in CMOS technology. Two-terminal programmable devices in crossbar geometry seem particularly attractive, but suffer from severe addressing difficulties due to cross-talk, which implies complex programming procedures. Three-terminal devices can be easily addressed individually, but with limited gain in terms of interconnect integration. We show how optically gated carbon nanotube devices enable efficient individual addressing when arranged in a crossbar geometry with shared gate electrodes. This topology is particularly well suited for parallel programming or learning in the context of neuromorphic computing architectures. PMID:20368686
Multi-level Hierarchical Poly Tree computer architectures
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug
1990-01-01
Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
Modified Golomb Algorithm for Computing Unit Fraction Expansions
ERIC Educational Resources Information Center
Man, Yiu-Kwong
2004-01-01
In this note, a modified Golomb algorithm for computing unit fraction expansions is presented. This algorithm has the advantage that the maximal denominators involved in the expansions will not exceed those computed by the original algorithm. In fact, the differences between the maximal denominators or the number of terms obtained by these two…
Thermodynamic cost of computation, algorithmic complexity and the information metric
NASA Technical Reports Server (NTRS)
Zurek, W. H.
1989-01-01
Algorithmic complexity is discussed as a computational counterpart to the second law of thermodynamics. It is shown that algorithmic complexity, which is a measure of randomness, sets limits on the thermodynamic cost of computations and casts a new light on the limitations of Maxwell's demon. Algorithmic complexity can also be used to define distance between binary strings.
Integration of nanoscale memristor synapses in neuromorphic computing architectures
NASA Astrophysics Data System (ADS)
Indiveri, Giacomo; Linares-Barranco, Bernabé; Legenstein, Robert; Deligeorgis, George; Prodromakis, Themistoklis
2013-09-01
Conventional neuro-computing architectures and artificial neural networks have often been developed with no or loose connections to neuroscience. As a consequence, they have largely ignored key features of biological neural processing systems, such as their extremely low-power consumption features or their ability to carry out robust and efficient computation using massively parallel arrays of limited precision, highly variable, and unreliable components. Recent developments in nano-technologies are making available extremely compact and low power, but also variable and unreliable solid-state devices that can potentially extend the offerings of availing CMOS technologies. In particular, memristors are regarded as a promising solution for modeling key features of biological synapses due to their nanoscale dimensions, their capacity to store multiple bits of information per element and the low energy required to write distinct states. In this paper, we first review the neuro- and neuromorphic computing approaches that can best exploit the properties of memristor and scale devices, and then propose a novel hybrid memristor-CMOS neuromorphic circuit which represents a radical departure from conventional neuro-computing approaches, as it uses memristors to directly emulate the biophysics and temporal dynamics of real synapses. We point out the differences between the use of memristors in conventional neuro-computing architectures and the hybrid memristor-CMOS circuit proposed, and argue how this circuit represents an ideal building block for implementing brain-inspired probabilistic computing paradigms that are robust to variability and fault tolerant by design.
Domain decomposition algorithms and computation fluid dynamics
NASA Technical Reports Server (NTRS)
Chan, Tony F.
1988-01-01
In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.
Hardware architecture for full analytical Fraunhofer computer-generated holograms
NASA Astrophysics Data System (ADS)
Pang, Zhi-Yong; Xu, Zong-Xi; Xiong, Yi; Chen, Biao; Dai, Hui-Min; Jiang, Shao-Ji; Dong, Jian-Wen
2015-09-01
Hardware architecture of parallel computation is proposed for generating Fraunhofer computer-generated holograms (CGHs). A pipeline-based integrated circuit architecture is realized by employing the modified Fraunhofer analytical formulism, which is large scale and enables all components to be concurrently operated. The architecture of the CGH contains five modules to calculate initial parameters of amplitude, amplitude compensation, phases, and phase compensation, respectively. The precalculator of amplitude is fully adopted considering the "reusable design" concept. Each complex operation type (such as square arithmetic) is reused only once by means of a multichannel selector. The implemented hardware calculates an 800×600 pixels hologram in parallel using 39,319 logic elements, 21,074 registers, and 12,651 memory bits in an Altera field-programmable gate array environment with stable operation at 50 MHz. Experimental results demonstrate that the quality of the images reconstructed from the hardware-generated hologram can be comparable to that of a software implementation. Moreover, the calculation speed is approximately 100 times faster than that of a personal computer with an Intel i5-3230M 2.6 GHz CPU for a triangular object.
Algorithmic support for commodity-based parallel computing systems.
Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
The Snowcloud System: Architecture and Algorithms for Snow Hydrology Studies
NASA Astrophysics Data System (ADS)
Skalka, C.; Brown, I.; Frolik, J.
2013-12-01
Snowcloud is an embedded data collection system for snow hydrology field research campaigns conducted in harsh climates and remote areas. The system combines distributed wireless sensor network technology and computational techniques to provide data at lower cost and higher spatio-temporal resolution than ground-based systems using traditional methods. Snowcloud has seen multiple Winter deployments in settings ranging from high desert to arctic, resulting in over a dozen node-years of practical experience. The Snowcloud system architecture consists of multiple TinyOS mesh-networked sensor stations collecting environmental data above and, in some deployments, below the snowpack. Monitored data modalities include snow depth, ground and air temperature, PAR and leaf-area index (LAI), and soil moisture. To enable power cycling and control of multiple sensors a custom power and sensor conditioning board was developed. The electronics and structural systems for individual stations have been designed and tested (in the lab and in situ) for ease of assembly and robustness to harsh winter conditions. Battery systems and solar chargers enable seasonal operation even under low/no light arctic conditions. Station costs range between 500 and 1000 depending on the instrumentation suite. For remote field locations, a custom designed hand-held device and data retrieval protocol serves as the primary data collection method. We are also developing and testing a Gateway device that will report data in near-real-time (NRT) over a cellular connection. Data is made available to users via web interfaces that also provide basic data analysis and visualization tools. For applications to snow hydrology studies, the better spatiotemporal resolution of snowpack data provided by Snowcloud is beneficial in several aspects. It provides insight into snowpack evolution, and allows us to investigate differences across different spatial and temporal scales in deployment areas. It enables the
Architectural requirements for the Red Storm computing system.
Camp, William J.; Tomkins, James Lee
2003-10-01
This report is based on the Statement of Work (SOW) describing the various requirements for delivering 3 new supercomputer system to Sandia National Laboratories (Sandia) as part of the Department of Energy's (DOE) Accelerated Strategic Computing Initiative (ASCI) program. This system is named Red Storm and will be a distributed memory, massively parallel processor (MPP) machine built primarily out of commodity parts. The requirements presented here distill extensive architectural and design experience accumulated over a decade and a half of research, development and production operation of similar machines at Sandia. Red Storm will have an unusually high bandwidth, low latency interconnect, specially designed hardware and software reliability features, a light weight kernel compute node operating system and the ability to rapidly switch major sections of the machine between classified and unclassified computing environments. Particular attention has been paid to architectural balance in the design of Red Storm, and it is therefore expected to achieve an atypically high fraction of its peak speed of 41 TeraOPS on real scientific computing applications. In addition, Red Storm is designed to be upgradeable to many times this initial peak capability while still retaining appropriate balance in key design dimensions. Installation of the Red Storm computer system at Sandia's New Mexico site is planned for 2004, and it is expected that the system will be operated for a minimum of five years following installation.
Scaling to Nanotechnology Limits with the PIMS Computer Architecture and a new Scaling Rule.
Debenedictis, Erik
2015-02-01
We describe a new approach to computing that moves towards the limits of nanotechnology using a newly formulated sc aling rule. This is in contrast to the current computer industry scali ng away from von Neumann's original computer at the rate of Moore's Law. We extend Moore's Law to 3D, which l eads generally to architectures that integrate logic and memory. To keep pow er dissipation cons tant through a 2D surface of the 3D structure requires using adiabatic principles. We call our newly proposed architecture Processor In Memory and Storage (PIMS). We propose a new computational model that integrates processing and memory into "tiles" that comprise logic, memory/storage, and communications functions. Since the programming model will be relatively stable as a system scales, programs repr esented by tiles could be executed in a PIMS system built with today's technology or could become the "schematic diagram" for implementation in an ultimate 3D nanotechnology of the future. We build a systems software approach that offers advantages over and above the technological and arch itectural advantages. Firs t, the algorithms may be more efficient in the conventional sens e of having fewer steps. Second, the algorithms may run with higher power efficiency per operation by being a better match for the adiabatic scaling ru le. The performance analysis based on demonstrated ideas in physical science suggests 80,000 x improvement in cost per operation for the (arguably) gene ral purpose function of emulating neurons in Deep Learning.
A Component Architecture for High-Performance Scientific Computing
Bernholdt, D E; Allan, B A; Armstrong, R; Bertrand, F; Chiu, K; Dahlgren, T L; Damevski, K; Elwasif, W R; Epperly, T W; Govindaraju, M; Katz, D S; Kohl, J A; Krishnan, M; Kumfert, G; Larson, J W; Lefantzi, S; Lewis, M J; Malony, A D; McInnes, L C; Nieplocha, J; Norris, B; Parker, S G; Ray, J; Shende, S; Windus, T L; Zhou, S
2004-12-14
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
A Component Architecture for High-Performance Scientific Computing
Bernholdt, David E; Allan, Benjamin A; Armstrong, Robert C; Bertrand, Felipe; Chiu, Kenneth; Dahlgren, Tamara L; Damevski, Kostadin; Elwasif, Wael R; Epperly, Thomas G; Govindaraju, Madhusudhan; Katz, Daniel S; Kohl, James A; Krishnan, Manoj Kumar; Kumfert, Gary K; Larson, J Walter; Lefantzi, Sophia; Lewis, Michael J; Malony, Allen D; McInnes, Lois C; Nieplocha, Jarek; Norris, Boyana; Parker, Steven G; Ray, Jaideep; Shende, Sameer; Windus, Theresa L; Zhou, Shujia
2006-07-03
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
HiMAT onboard flight computer system architecture and qualification
NASA Technical Reports Server (NTRS)
Myers, A. F.; Earls, M. R.; Callizo, L. A.
1981-01-01
Two highly maneuverable aircraft technology (HiMAT) remotely piloted research vehicles (RPRV's) are being flight tested at NASA Dryden Flight Research Center, Edwards, California, to demonstrate and evaluate a number of technological advances applicable to future fighter aircraft. Closed-loop primary flight control is performed from a ground-based cockpit utilizing a digital computer and up/down telemetry links. A backup flight control system for emergency operation resides in one of two onboard computers. Other functions of the onboard computer system are uplink processing, downlink processing, engine control, failure detection, and redundancy management. This paper describes the architecture, functions, and flight qualification of the HiMAT onboard flight computer systems.
Reveal, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures
NASA Technical Reports Server (NTRS)
Liang, Shoudan; Fuhrman, Stefanie; Somogyi, Roland
1998-01-01
Given the immanent gene expression mapping covering whole genomes during development, health and disease, we seek computational methods to maximize functional inference from such large data sets. Is it possible, in principle, to completely infer a complex regulatory network architecture from input/output patterns of its variables? We investigated this possibility using binary models of genetic networks. Trajectories, or state transition tables of Boolean nets, resemble time series of gene expression. By systematically analyzing the mutual information between input states and output states, one is able to infer the sets of input elements controlling each element or gene in the network. This process is unequivocal and exact for complete state transition tables. We implemented this REVerse Engineering ALgorithm (REVEAL) in a C program, and found the problem to be tractable within the conditions tested so far. For n = 50 (elements) and k = 3 (inputs per element), the analysis of incomplete state transition tables (100 state transition pairs out of a possible 10(exp 15)) reliably produced the original rule and wiring sets. While this study is limited to synchronous Boolean networks, the algorithm is generalizable to include multi-state models, essentially allowing direct application to realistic biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields.
Distributed sequence alignment applications for the public computing architecture.
Pellicer, S; Chen, G; Chan, K C C; Pan, Y
2008-03-01
The public computer architecture shows promise as a platform for solving fundamental problems in bioinformatics such as global gene sequence alignment and data mining with tools such as the basic local alignment search tool (BLAST). Our implementation of these two problems on the Berkeley open infrastructure for network computing (BOINC) platform demonstrates a runtime reduction factor of 1.15 for sequence alignment and 16.76 for BLAST. While the runtime reduction factor of the global gene sequence alignment application is modest, this value is based on a theoretical sequential runtime extrapolated from the calculation of a smaller problem. Because this runtime is extrapolated from running the calculation in memory, the theoretical sequential runtime would require 37.3 GB of memory on a single system. With this in mind, the BOINC implementation not only offers the reduced runtime, but also the aggregation of the available memory of all participant nodes. If an actual sequential run of the problem were compared, a more drastic reduction in the runtime would be seen due to an additional secondary storage I/O overhead for a practical system. Despite the limitations of the public computer architecture, most notably in communication overhead, it represents a practical platform for grid- and cluster-scale bioinformatics computations today and shows great potential for future implementations. PMID:18334454
The computational structural mechanics testbed architecture. Volume 2: The interface
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the third set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 3 describes the CLIP-Processor interface and related topics. It is intended only for processor developers.
Integrated command, control, communications and computation system functional architecture
NASA Technical Reports Server (NTRS)
Cooley, C. G.; Gilbert, L. E.
1981-01-01
The functional architecture for an integrated command, control, communications, and computation system applicable to the command and control portion of the NASA End-to-End Data. System is described including the downlink data processing and analysis functions required to support the uplink processes. The functional architecture is composed of four elements: (1) the functional hierarchy which provides the decomposition and allocation of the command and control functions to the system elements; (2) the key system features which summarize the major system capabilities; (3) the operational activity threads which illustrate the interrelationahip between the system elements; and (4) the interfaces which illustrate those elements that originate or generate data and those elements that use the data. The interfaces also provide a description of the data and the data utilization and access techniques.
The computational structural mechanics testbed architecture. Volume 2: Directives
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the second of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 2 describes the CLIP directives in detail. It is intended for intermediate and advanced users.
The computational structural mechanics testbed architecture. Volume 1: The language
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the first set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP, and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 1 presents the basic elements of the CLAMP language and is intended for all users.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
NASA Technical Reports Server (NTRS)
Rutishauser, David K.
2006-01-01
The motivation for this work comes from an observation that amidst the push for Massively Parallel (MP) solutions to high-end computing problems such as numerical physical simulations, large amounts of legacy code exist that are highly optimized for vector supercomputers. Because re-hosting legacy code often requires a complete re-write of the original code, which can be a very long and expensive effort, this work examines the potential to exploit reconfigurable computing machines in place of a vector supercomputer to implement an essentially unmodified legacy source code. Custom and reconfigurable computing resources could be used to emulate an original application's target platform to the extent required to achieve high performance. To arrive at an architecture that delivers the desired performance subject to limited resources involves solving a multi-variable optimization problem with constraints. Prior research in the area of reconfigurable computing has demonstrated that designing an optimum hardware implementation of a given application under hardware resource constraints is an NP-complete problem. The premise of the approach is that the general issue of applying reconfigurable computing resources to the implementation of an application, maximizing the performance of the computation subject to physical resource constraints, can be made a tractable problem by assuming a computational paradigm, such as vector processing. This research contributes a formulation of the problem and a methodology to design a reconfigurable vector processing implementation of a given application that satisfies a performance metric. A generic, parametric, architectural framework for vector processing implemented in reconfigurable logic is developed as a target for a scheduling/mapping algorithm that maps an input computation to a given instance of the architecture. This algorithm is integrated with an optimization framework to arrive at a specification of the architecture parameters
Architecture for High Speed Learning of Neural Network using Genetic Algorithm
NASA Astrophysics Data System (ADS)
Yoshikawa, Masaya; Terai, Hidekazu
This paper discusses the architecture for high speed learning of Neural Network (NN) using Genetic Algorithm (GA). The proposed architecture prevents local minimum by using the GA characteristic of holding several individual populations for a population-based search and achieves high speed processing adopting dedicated hardware. To keep general purpose equal software processing, the proposed architecture can be flexible genetic operations on GA and is introduced both Sigmoid function and Heaviside function on NN. Furthermore, the proposed architecture is not optimized only the pipeline at evaluation phase on NN, but also optimized hierarchic pipelines on the whole at evolutionary phase. We have done the simulation, verification and logic synthesis using library of 0.35μm CMOS standard cell. Simulation results evaluating the proposed architecture show to achieve 22 times speed on average compared with software processing.
Evaluation of leading scalar and vector architectures for scientific computations
Simon, Horst D.; Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Ethier, Stephane; Shalf, John
2004-04-20
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This project examines the performance of the cacheless vector Earth Simulator (ES) and compares it to superscalar cache-based IBM Power3 system. Results demonstrate that the ES is significantly faster than the Power3 architecture, highlighting the tremendous potential advantage of the ES for numerical simulation. However, vectorization of a particle-in-cell application (GTC) greatly increased the memory footprint preventing loop-level parallelism and limiting scalability potential.
The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures
Spafford, Kyle L; Meredith, Jeremy S; Lee, Seyong; Li, Dong; Roth, Philip C; Vetter, Jeffrey S
2012-01-01
With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Sorting on STAR. [CDC computer algorithm timing comparison
NASA Technical Reports Server (NTRS)
Stone, H. S.
1978-01-01
Timing comparisons are given for three sorting algorithms written for the CDC STAR computer. One algorithm is Hoare's (1962) Quicksort, which is the fastest or nearly the fastest sorting algorithm for most computers. A second algorithm is a vector version of Quicksort that takes advantage of the STAR's vector operations. The third algorithm is an adaptation of Batcher's (1968) sorting algorithm, which makes especially good use of vector operations but has a complexity of N(log N)-squared as compared with a complexity of N log N for the Quicksort algorithms. In spite of its worse complexity, Batcher's sorting algorithm is competitive with the serial version of Quicksort for vectors up to the largest that can be treated by STAR. Vector Quicksort outperforms the other two algorithms and is generally preferred. These results indicate that unusual instruction sets can introduce biases in program execution time that counter results predicted by worst-case asymptotic complexity analysis.
Biomorphic Multi-Agent Architecture for Persistent Computing
NASA Technical Reports Server (NTRS)
Lodding, Kenneth N.; Brewster, Paul
2009-01-01
A multi-agent software/hardware architecture, inspired by the multicellular nature of living organisms, has been proposed as the basis of design of a robust, reliable, persistent computing system. Just as a multicellular organism can adapt to changing environmental conditions and can survive despite the failure of individual cells, a multi-agent computing system, as envisioned, could adapt to changing hardware, software, and environmental conditions. In particular, the computing system could continue to function (perhaps at a reduced but still reasonable level of performance) if one or more component( s) of the system were to fail. One of the defining characteristics of a multicellular organism is unity of purpose. In biology, the purpose is survival of the organism. The purpose of the proposed multi-agent architecture is to provide a persistent computing environment in harsh conditions in which repair is difficult or impossible. A multi-agent, organism-like computing system would be a single entity built from agents or cells. Each agent or cell would be a discrete hardware processing unit that would include a data processor with local memory, an internal clock, and a suite of communication equipment capable of both local line-of-sight communications and global broadcast communications. Some cells, denoted specialist cells, could contain such additional hardware as sensors and emitters. Each cell would be independent in the sense that there would be no global clock, no global (shared) memory, no pre-assigned cell identifiers, no pre-defined network topology, and no centralized brain or control structure. Like each cell in a living organism, each agent or cell of the computing system would contain a full description of the system encoded as genes, but in this case, the genes would be components of a software genome.
A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures
NASA Astrophysics Data System (ADS)
Lopes, Raul H. C.; Reid, Ivan D.; Hobson, Peter R.
2014-06-01
Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track fitting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n2). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.
A Component Architecture for High-Performance Computing
Bernholdt, D E; Elwasif, W R; Kohl, J A; Epperly, T G W
2003-01-21
The Common Component Architecture (CCA) provides a means for developers to manage the complexity of large-scale scientific software systems and to move toward a ''plug and play'' environment for high-performance computing. The CCA model allows for a direct connection between components within the same process to maintain performance on inter-component calls. It is neutral with respect to parallelism, allowing components to use whatever means they desire to communicate within their parallel ''cohort.'' We will discuss in detail the importance of performance in the design of the CCA and will analyze the performance costs associated with features of the CCA.
Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; Van der Wijngaart, Rob
2003-05-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly out performs the cache-based architectures. However, certain applications are not easily amenable to vectorization and would re quire extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator
ERIC Educational Resources Information Center
Patti, D.; Spadaccini, A.; Palesi, M.; Fazzino, F.; Catania, V.
2012-01-01
The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This…