Algorithms versus architectures for computational chemistry
NASA Technical Reports Server (NTRS)
Partridge, H.; Bauschlicher, C. W., Jr.
1986-01-01
The algorithms employed are computationally intensive and, as a result, increased performance (both algorithmic and architectural) is required to improve accuracy and to treat larger molecular systems. Several benchmark quantum chemistry codes are examined on a variety of architectures. While these codes are only a small portion of a typical quantum chemistry library, they illustrate many of the computationally intensive kernels and data manipulation requirements of some applications. Furthermore, understanding the performance of the existing algorithm on present and proposed supercomputers serves as a guide for future programs and algorithm development. The algorithms investigated are: (1) a sparse symmetric matrix vector product; (2) a four index integral transformation; and (3) the calculation of diatomic two electron Slater integrals. The vectorization strategies are examined for these algorithms for both the Cyber 205 and Cray XMP. In addition, multiprocessor implementations of the algorithms are looked at on the Cray XMP and on the MIT static data flow machine proposed by DENNIS.
Parallel algorithms and architecture for computation of manipulator forward dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Communication-efficient parallel architectures and algorithms for image computations
Alnuweiri, H.M.
1989-01-01
The main purpose of this dissertation is the design of efficient parallel techniques for image computations which require global operations on image pixels, as well as the development of parallel architectures with special communication features which can support global data movement efficiently. The class of image problems considered in this dissertation involves global operations on image pixels, and irregular (data-dependent) data movement operations. Such problems include histogramming, component labeling, proximity computations, computing the Hough Transform, computing convexity of regions and related properties such as computing the diameter and a smallest area enclosing rectangle for each region. Images with multiple figures and multiple labeled-sets of pixels are also considered. Efficient solutions to such problems involve integer sorting, graph theoretic techniques, and techniques from computational geometry. Although such solutions are not computationally intensive (they all require O(n{sup 2}) operations to be performed on an n {times} n image), they require global communications. The emphasis here is on developing parallel techniques for data movement, reduction, and distribution, which lead to processor-time optimal solutions for such problems on the proposed organizations. The proposed parallel architectures are based on a memory array which can be viewed as an arrangement of memory modules in a k-dimensional space such that the modules are connected to buses placed parallel to the orthogonal axes of the space, and each bus is connected to one processor or a group of processors. It will be shown that such organizations are communication-efficient and are thus highly suited to the image problems considered here, and also to several other classes of problems. The proposed organizations have p processors and O(n{sup 2}) words of memory to process n {times} n images.
Algorithmic and architectural optimizations for computationally efficient particle filtering.
Sankaranarayanan, Aswin C; Srivastava, Ankur; Chellappa, Rama
2008-05-01
In this paper, we analyze the computational challenges in implementing particle filtering, especially to video sequences. Particle filtering is a technique used for filtering nonlinear dynamical systems driven by non-Gaussian noise processes. It has found widespread applications in detection, navigation, and tracking problems. Although, in general, particle filtering methods yield improved results, it is difficult to achieve real time performance. In this paper, we analyze the computational drawbacks of traditional particle filtering algorithms, and present a method for implementing the particle filter using the Independent Metropolis Hastings sampler, that is highly amenable to pipelined implementations and parallelization. We analyze the implementations of the proposed algorithm, and, in particular, concentrate on implementations that have minimum processing times. It is shown that the design parameters for the fastest implementation can be chosen by solving a set of convex programs. The proposed computational methodology was verified using a cluster of PCs for the application of visual tracking. We demonstrate a linear speed-up of the algorithm using the methodology proposed in the paper. PMID:18390378
A fast algorithm for parallel computation of multibody dynamics on MIMD parallel architectures
NASA Technical Reports Server (NTRS)
Fijany, Amir; Kwan, Gregory; Bagherzadeh, Nader
1993-01-01
In this paper the implementation of a parallel O(LogN) algorithm for computation of rigid multibody dynamics on a Hypercube MIMD parallel architecture is presented. To our knowledge, this is the first algorithm that achieves the time lower bound of O(LogN) by using an optimal number of O(N) processors. However, in addition to its theoretical significance, the algorithm is also highly efficient for practical implementation on commercially available MIMD parallel architectures due to its highly coarse grain size and simple communication and synchronization requirements. We present a multilevel parallel computation strategy for implementation of the algorithm on a Hypercube. This strategy allows the exploitation of parallelism at several computational levels as well as maximum overlapping of computation and communication to increase the performance of parallel computation.
Parallel algorithms and architectures
Albrecht, A.; Jung, H.; Mehlhorn, K.
1987-01-01
Contents of this book are the following: Preparata: Deterministic simulation of idealized parallel computers on more realistic ones; Convex hull of randomly chosen points from a polytope; Dataflow computing; Parallel in sequence; Towards the architecture of an elementary cortical processor; Parallel algorithms and static analysis of parallel programs; Parallel processing of combinatorial search; Communications; An O(nlogn) cost parallel algorithms for the single function coarsest partition problem; Systolic algorithms for computing the visibility polygon and triangulation of a polygonal region; and RELACS - A recursive layout computing system. Parallel linear conflict-free subtree access.
Sanz, J.L.; Dinstein, I.
1987-01-01
In this paper, some image transforms and features such as projections along linear patterns, convex hull approximations, Hough transform for line detection, diameter, moments, and principal components will be considered. Specifically, we present algorithms for computing these features which are suitable for implementation in image analysis pipeline architectures. In particular, random access memories and other dedicated hardware components which may be found in the implementation of classical techniques are not longer needed in our algorithms. The effectiveness of our approach is demonstrated by running some of the new algorithms in conventional short-pipelines for image analysis.
Algorithmically specialized parallel computers
Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.
1985-01-01
This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.
Design And Implementation Of A Multi-Sensor Fusion Algorithm On A Hypercube Computer Architecture
NASA Astrophysics Data System (ADS)
Glover, Charles W.
1990-03-01
A multi-sensor integration (MSI) algorithm written for sequential single processor computer architecture has been transformed into a concurrent algorithm and implemented in parallel on a multi-processor hypercube computer architecture. This paper will present the philosophy and methodologies used in the decomposition of the sequential MSI algorithm, and its transformation into a parallel MSI algorithm. The parallel MSI algorithm was implemented on a NCUBETM hypercube computer. The performance of the parallel MSI algorithm has been measured and compared against its sequential counterpart by running test case scenarios through a simulation program. The simulation program allows the user to define the trajectories of all players in the scenario, and to pick the sensor suites of the players and their operating characteristics. For example, an air-to-air engagement scenario was used as one of the test cases. In this scenario, two friend aircrafts were being attacked by six foe aircraft in a pincer maneuver. Both the friend and foe aircrafts launch missiles at several different time points in the engagement. The sensor suites on each aircraft are dual mode RADAR, dual mode IRST, and ESM sensors. The modes of the sensors are switched as needed throughout the scenario. The RADAR sensor is used only intermittently, thus most of the MSI information is obtained from passive sensing. The maneuvers in this scenario caused aircraft and missile to constantly fly in and out of sensors field-of-view (F0V). This resulted in the MSI algorithm to constantly reacquire, initiate, and delete new tracks as it tracked all objects in the scenario. The objective was to determine performance of the parallel MSI algorithm in such a complex environment, and to determine how many multi-processors (nodes) of the hypercube could be effectively used by an aircraft in such an environment. For the scenario just discussed, a 4-node hypercube was found to be the optimal size and a factor two in speedup
Introduction to the special section on computer architectures and parallel algorithms for PAMI
Dyer, C.R.
1989-03-01
The topic of multiprocessor computer architectures and parallel algorithms for computer vision and related applications is not new, but researchers are now addressing both a wider scope of issues and emphasizing system integration. Recently, a wide variety of different systems have been designed, built, and tested on a range of image understanding tasks. An important goal beginning to be addressed is how to achieve high performance when a complete, integrated set of component vision processes are combined. The papers in this special section describe a number of approaches to improving the performance of vision architectures. Each paper uses a different model of parallel processing. The first four papers describe machines or chips which have been built, each exhibiting certain advantages for vision. One important distinction between these approaches is in terms of the number of processors used, defining the granularity of parallel processing. The first three papers also evaluate the performance of their systems on a suite of vision tasks covering several image representations and processing requirements.
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Youngblood, John N.; Saha, Aindam
1987-01-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Carroll, C.C.; Youngblood, J.N.; Saha, A.
1987-12-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Tiberghien, J.
1984-01-01
This book presents papers on supercomputers. Topics considered include decentralized computer architecture, new programming languages, data flow computers, reduction computers, parallel prefix calculations, structural and behavioral descriptions of digital systems, instruction sets, software generation, personal computing, and computer architecture education.
Parallel Architecture For Robotics Computation
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1990-01-01
Universal Real-Time Robotic Controller and Simulator (URRCS) is highly parallel computing architecture for control and simulation of robot motion. Result of extensive algorithmic study of different kinematic and dynamic computational problems arising in control and simulation of robot motion. Study led to development of class of efficient parallel algorithms for these problems. Represents algorithmically specialized architecture, in sense capable of exploiting common properties of this class of parallel algorithms. System with both MIMD and SIMD capabilities. Regarded as processor attached to bus of external host processor, as part of bus memory.
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-06-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner for scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.
Feng, Lu; Fedrigo, Enrico; Béchet, Clémentine; Brunner, Elisabeth; Pirani, Werther
2012-06-01
The European Southern Observatory (ESO) is studying the next generation giant telescope, called the European Extremely Large Telescope (E-ELT). With a 42 m diameter primary mirror, it is a significant step from currently existing telescopes. Therefore, the E-ELT with its instruments poses new challenges in terms of cost and computational complexity for the control system, including its adaptive optics (AO). Since the conventional matrix-vector multiplication (MVM) method successfully used so far for AO wavefront reconstruction cannot be efficiently scaled to the size of the AO systems on the E-ELT, faster algorithms are needed. Among those recently developed wavefront reconstruction algorithms, three are studied in this paper from the point of view of design, implementation, and absolute speed on three multicore multi-CPU platforms. We focus on a single-conjugate AO system for the E-ELT. The algorithms are the MVM, the Fourier transform reconstructor (FTR), and the fractal iterative method (FRiM). This study enhances the scaling of these algorithms with an increasing number of CPUs involved in the computation. We discuss implementation strategies, depending on various CPU architecture constraints, and we present the first quantitative execution times so far at the E-ELT scale. MVM suffers from a large computational burden, making the current computing platform undersized to reach timings short enough for AO wavefront reconstruction. In our study, the FTR provides currently the fastest reconstruction. FRiM is a recently developed algorithm, and several strategies are investigated and presented here in order to implement it for real-time AO wavefront reconstruction, and to optimize its execution time. The difficulty to parallelize the algorithm in such architecture is enhanced. We also show that FRiM can provide interesting scalability using a sparse matrix approach. PMID:22695596
Introduction to systolic algorithms and architectures
Bentley, J.L.; Kung, H.T.
1983-01-01
The authors survey the class of systolic special-purpose computer architectures and algorithms, which are particularly well-suited for implementation in very large scale integrated circuitry (VLSI). They give a brief introduction to systolic arrays for a reader with a broad technical background and some experience in using a computer, but who is not necessarily a computer scientist. In addition they briefly survey the technological advances in VLSI that led to the development of systolic algorithms and architectures. 38 references.
Parallel processing algorithms for hydrocodes on a computer with MIMD architecture (DENELCOR's HEP)
Hicks, D.L.
1983-11-01
In real time simulation/prediction of complex systems such as water-cooled nuclear reactors, if reactor operators had fast simulator/predictors to check the consequences of their operations before implementing them, events such as the incident at Three Mile Island might be avoided. However, existing simulator/predictors such as RELAP run slower than real time on serial computers. It appears that the only way to overcome the barrier to higher computing rates is to use computers with architectures that allow concurrent computations or parallel processing. The computer architecture with the greatest degree of parallelism is labeled Multiple Instruction Stream, Multiple Data Stream (MIMD). An example of a machine of this type is the HEP computer by DENELCOR. It appears that hydrocodes are very well suited for parallelization on the HEP. It is a straightforward exercise to parallelize explicit, one-dimensional Lagrangean hydrocodes in a zone-by-zone parallelization. Similarly, implicit schemes can be parallelized in a zone-by-zone fashion via an a priori, symbolic inversion of the tridiagonal matrix that arises in an implicit scheme. These techniques are extended to Eulerian hydrocodes by using Harlow's rezone technique. The extension from single-phase Eulerian to two-phase Eulerian is straightforward. This step-by-step extension leads to hydrocodes with zone-by-zone parallelization that are capable of two-phase flow simulation. Extensions to two and three spatial dimensions can be achieved by operator splitting. It appears that a zone-by-zone parallelization is the best way to utilize the capabilities of an MIMD machine. 40 references.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
NASA Astrophysics Data System (ADS)
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
2011-07-01
We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 × 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Architecture Adaptive Computing Environment
NASA Technical Reports Server (NTRS)
Dorband, John E.
2006-01-01
Architecture Adaptive Computing Environment (aCe) is a software system that includes a language, compiler, and run-time library for parallel computing. aCe was developed to enable programmers to write programs, more easily than was previously possible, for a variety of parallel computing architectures. Heretofore, it has been perceived to be difficult to write parallel programs for parallel computers and more difficult to port the programs to different parallel computing architectures. In contrast, aCe is supportable on all high-performance computing architectures. Currently, it is supported on LINUX clusters. aCe uses parallel programming constructs that facilitate writing of parallel programs. Such constructs were used in single-instruction/multiple-data (SIMD) programming languages of the 1980s, including Parallel Pascal, Parallel Forth, C*, *LISP, and MasPar MPL. In aCe, these constructs are extended and implemented for both SIMD and multiple- instruction/multiple-data (MIMD) architectures. Two new constructs incorporated in aCe are those of (1) scalar and virtual variables and (2) pre-computed paths. The scalar-and-virtual-variables construct increases flexibility in optimizing memory utilization in various architectures. The pre-computed-paths construct enables the compiler to pre-compute part of a communication operation once, rather than computing it every time the communication operation is performed.
Kennington, J.L.; Helgason, R.V.
1989-06-29
One of the most important computer architecture innovations to appear in the marketplace during the last ten years is parallel processing on a shared memory multicomputer. This report presents empirical results on a Sequent Symmetry S81 on four optimization models used in the area of personnel assignment. Both the detailed algorithms and computational results are presented. Contents: Minimal Spanning Trees - An Empirical Investigation of Parallel Algorithms; Dijkstra's Two-tree Shortest Path Algorithm; An Empirical Analysis of the Dense Assignment Problem; and Solving Generalized Network Problems on a Shared Memory Multiprocessor.
Mapping algorithms on regular parallel architectures
Lee, P.
1989-01-01
It is significant that many of time-intensive scientific algorithms are formulated as nested loops, which are inherently regularly structured. In this dissertation the relations between the mathematical structure of nested loop algorithms and the architectural capabilities required for their parallel execution are studied. The architectural model considered in depth is that of an arbitrary dimensional systolic array. The mathematical structure of the algorithm is characterized by classifying its data-dependence vectors according to the new ZERO-ONE-INFINITE property introduced. Using this classification, the first complete set of necessary and sufficient conditions for correct transformation of a nested loop algorithm onto a given systolic array of an arbitrary dimension by means of linear mappings is derived. Practical methods to derive optimal or suboptimal systolic array implementations are also provided. The techniques developed are used constructively to develop families of implementations satisfying various optimization criteria and to design programmable arrays efficiently executing classes of algorithms. In addition, a Computer-Aided Design system running on SUN workstations has been implemented to help in the design. The methodology, which deals with general algorithms, is illustrated by synthesizing linear and planar systolic array algorithms for matrix multiplication, a reindexed Warshall-Floyd transitive closure algorithm, and the longest common subsequence algorithm.
Kennington, J.L.; Helgason, R.V.
1990-10-01
One of the most important computer architecture innovations to appear in the market place during the last ten years is parallel processing on a shared memory multicomputer. This report presents new algorithms for a variety of network models along with empirical analysis on both sequential and parallel computers. An empirical study on the AT and T KORBX system is also presented. This system uses eight processors each of which has vector capability. Our research program objective is to develop and empirically test new parallel algorithms and software for a wide variety of optimization problems. The problems studied this past year include the shortest path problem, the assignment problem, the semi-assignment problem, the transportation problem, and the generalized network problem. Algorithms for all of these models have been developed and empirically tested on a variety of computers. In addition, we worked with the Military Airlift Command to test the ATT KORBX system located at Scott Air Force Base.
The flight telerobotic servicer: From functional architecture to computer architecture
NASA Technical Reports Server (NTRS)
Lumia, Ronald; Fiala, John
1989-01-01
After a brief tutorial on the NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) functional architecture, the approach to its implementation is shown. First, interfaces must be defined which are capable of supporting the known algorithms. This is illustrated by considering the interfaces required for the SERVO level of the NASREM functional architecture. After interface definition, the specific computer architecture for the implementation must be determined. This choice is obviously technology dependent. An example illustrating one possible mapping of the NASREM functional architecture to a particular set of computers which implements it is shown. The result of choosing the NASREM functional architecture is that it provides a technology independent paradigm which can be mapped into a technology dependent implementation capable of evolving with technology in the laboratory and in space.
Neural algorithms on VLSI concurrent architectures
Caviglia, D.D.; Bisio, G.M.; Parodi, G.
1988-09-01
The research concerns the study of neural algorithms for developing CAD tools with A.I. features in VLSI design activities. In this paper the focus is on optimization problems such as partitioning, placement and routing. These problems require massive computational power to be solved (NP-complete problems) and the standard approach is usually based on euristic techniques. Neural algorithms can be represented by a circuital model. This kind of representation can be easily mapped in a real circuit, which, however, features limited flexibility with respect to the variety of problems. In this sense the simulation of the neural circuit, by mapping it on a digital VLSI concurrent architecture seems to be preferrable; in addition this solution offers a wider choice with regard to algorithms characteristics (e.g. transfer curve of neural elements, reconfigurability of interconnections, etc.). The implementation with programmable components, such as transputers, allows an indirect mapping of the algorithm (one transputer for N neurons) accordingly to the dimension and the characteristics of the problem. In this way the neural algorithm described by the circuit is reduced to the algorithm that simulates the network behavior. The convergence properties of that formulation are studied with respect to the characteristics of the neural element transfer curve.
Stereoscopic depth perception for robot vision: algorithms and architectures
Safranek, R.J.; Kak, A.C.
1983-01-01
The implementation of depth perception algorithms for computer vision is considered. In automated manufacturing, depth information is vital for tasks such as path planning and 3-d scene analysis. The presentation begins with a survey of computer algorithms for stereoscopic depth perception. The emphasis is on the Marr-Poggio paradigm of human stereo vision and its computer implementation. In addition, a stereo matching algorithm based on the relaxation labelling technique is examined. A computer architecture designed to efficiently implement stereo matching algorithms, an MIMD array interfaced to a global memory, is presented. 9 references.
Parallel algorithms and architectures for the manipulator inertia matrix
Amin-Javaheri, M.
1989-01-01
Several parallel algorithms and architectures to compute the manipulator inertia matrix in real time are proposed. An O(N) and an O(log{sub 2}N) parallel algorithm based upon recursive computation of the inertial parameters of sets of composite rigid bodies are formulated. One- and two-dimensional systolic architectures are presented to implement the O(N) parallel algorithm. A cube architecture is employed to implement the diagonal element of the inertia matrix in O(log{sub 2}N) time and the upper off-diagonal elements in O(N) time. The resulting K{sub 1}O(N) + K{sub 2}O(log{sub 2}N) parallel algorithm is more efficient for a cube network implementation. All the architectural configurations are based upon a VLSI Robotics Processor exploiting fine-grain parallelism. In evaluation all the architectural configurations, significant performance parameters such as I/O time and idle time due to processor synchronization as well as CPU utilization and on-chip memory size are fully included. The O(N) and O(log{sub 2}N) parallel algorithms adhere to the precedence relationships among the processors. In order to achieve a higher speedup factor; however, parallel algorithms in conjunction with Non-Strict Computational Models are devised to relax interprocess precedence, and as a result, to decrease the effective computational delays. The effectiveness of the Non-strict Computational Algorithms is verified by computer simulations, based on a PUMA 560 robot manipulator. It is demonstrated that a combination of parallel algorithms and architectures results in a very effective approach to achieve real-time response for computing the manipulator inertia matrix.
Recursive computer architecture for VLSI
Treleaven, P.C.; Hopkins, R.P.
1982-01-01
A general-purpose computer architecture based on the concept of recursion and suitable for VLSI computer systems built from replicated (lego-like) computing elements is presented. The recursive computer architecture is defined by presenting a program organisation, a machine organisation and an experimental machine implementation oriented to VLSI. The experimental implementation is being restricted to simple, identical microcomputers each containing a memory, a processor and a communications capability. This future generation of lego-like computer systems are termed fifth generation computers by the Japanese. 30 references.
Computing architecture for autonomous microgrids
Goldsmith, Steven Y.
2015-09-29
A computing architecture that facilitates autonomously controlling operations of a microgrid is described herein. A microgrid network includes numerous computing devices that execute intelligent agents, each of which is assigned to a particular entity (load, source, storage device, or switch) in the microgrid. The intelligent agents can execute in accordance with predefined protocols to collectively perform computations that facilitate uninterrupted control of the .
Chang, C.Y.
1986-01-01
New results on efficient forms of decoding convolutional codes based on Viterbi and stack algorithms using systolic array architecture are presented. Some theoretical aspects of systolic arrays are also investigated. First, systolic array implementation of Viterbi algorithm is considered, and various properties of convolutional codes are derived. A technique called strongly connected trellis decoding is introduced to increase the efficient utilization of all the systolic array processors. The issues dealing with the composite branch metric generation, survivor updating, overall system architecture, throughput rate, and computations overhead ratio are also investigated. Second, the existing stack algorithm is modified and restated in a more concise version so that it can be efficiently implemented by a special type of systolic array called systolic priority queue. Three general schemes of systolic priority queue based on random access memory, shift register, and ripple register are proposed. Finally, a systematic approach is presented to design systolic arrays for certain general classes of recursively formulated algorithms.
Unifying parametrized VLSI Jacobi algorithms and architectures
NASA Astrophysics Data System (ADS)
Deprettere, Ed F. A.; Moonen, Marc
1993-11-01
Implementing Jacobi algorithms in parallel VLSI processor arrays is a non-trivial task, in particular when the algorithms are parametrized with respect to size and the architectures are parametrized with respect to space-time trade-offs. The paper is concerned with an approach to implement several time-adaptive Jacobi-type algorithms on a parallel processor array, using only Cordic arithmetic and asynchronous communications, such that any degree of parallelism, ranging from single-processor up to full-size array implementation, is supported by a `universal' processing unit. This result is attributed to a gracious interplay between algorithmic and architectural engineering.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than line relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. The parallel implementation of a V-cycle multiple semi-coarsened grid (MSG) algorithm or distributed-memory architectures such as the Intel iPSC/860 and Paragon computers is addressed. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. A mapping of an MSG algorithm to distributed-memory architectures that demonstrate how both levels of parallelism can be exploited is described. The results is a robust and effective multigrid algorithm for distributed-memory machines.
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
Optical linear algebra processors - Architectures and algorithms
NASA Technical Reports Server (NTRS)
Casasent, David
1986-01-01
Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.
NASA Astrophysics Data System (ADS)
Basu, S.; Ganguly, S.; Nemani, R. R.; Mukhopadhyay, S.; Milesi, C.; Votava, P.; Michaelis, A.; Zhang, G.; Cook, B. D.; Saatchi, S. S.; Boyda, E.
2014-12-01
Accurate tree cover delineation is a useful instrument in the derivation of Above Ground Biomass (AGB) density estimates from Very High Resolution (VHR) satellite imagery data. Numerous algorithms have been designed to perform tree cover delineation in high to coarse resolution satellite imagery, but most of them do not scale to terabytes of data, typical in these VHR datasets. In this paper, we present an automated probabilistic framework for the segmentation and classification of 1-m VHR data as obtained from the National Agriculture Imagery Program (NAIP) for deriving tree cover estimates for the whole of Continental United States, using a High Performance Computing Architecture. The results from the classification and segmentation algorithms are then consolidated into a structured prediction framework using a discriminative undirected probabilistic graphical model based on Conditional Random Field (CRF), which helps in capturing the higher order contextual dependencies between neighboring pixels. Once the final probability maps are generated, the framework is updated and re-trained by incorporating expert knowledge through the relabeling of misclassified image patches. This leads to a significant improvement in the true positive rates and reduction in false positive rates. The tree cover maps were generated for the state of California, which covers a total of 11,095 NAIP tiles and spans a total geographical area of 163,696 sq. miles. Our framework produced correct detection rates of around 85% for fragmented forests and 70% for urban tree cover areas, with false positive rates lower than 3% for both regions. Comparative studies with the National Land Cover Data (NLCD) algorithm and the LiDAR high-resolution canopy height model shows the effectiveness of our algorithm in generating accurate high-resolution tree cover maps.
Algorithmically Specialized Parallel Architecture For Robotics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1991-01-01
Computing system called Robot Mathematics Processor (RMP) contains large number of processor elements (PE's) connected in various parallel and serial combinations reconfigurable via software. Special-purpose architecture designed for solving diverse computational problems in robot control, simulation, trajectory generation, workspace analysis, and like. System an MIMD-SIMD parallel architecture capable of exploiting parallelism in different forms and at several computational levels. Major advantage lies in design of cells, which provides flexibility and reconfigurability superior to previous SIMD processors.
Savannah River Site computing architecture
Not Available
1991-03-29
A computing architecture is a framework for making decisions about the implementation of computer technology and the supporting infrastructure. Because of the size, diversity, and amount of resources dedicated to computing at the Savannah River Site (SRS), there must be an overall strategic plan that can be followed by the thousands of site personnel who make decisions daily that directly affect the SRS computing environment and impact the site`s production and business systems. This plan must address the following requirements: There must be SRS-wide standards for procurement or development of computing systems (hardware and software). The site computing organizations must develop systems that end users find easy to use. Systems must be put in place to support the primary function of site information workers. The developers of computer systems must be given tools that automate and speed up the development of information systems and applications based on computer technology. This document describes a proposal for a site-wide computing architecture that addresses the above requirements. In summary, this architecture is standards-based data-driven, and workstation-oriented with larger systems being utilized for the delivery of needed information to users in a client-server relationship.
Savannah River Site computing architecture
Not Available
1991-03-29
A computing architecture is a framework for making decisions about the implementation of computer technology and the supporting infrastructure. Because of the size, diversity, and amount of resources dedicated to computing at the Savannah River Site (SRS), there must be an overall strategic plan that can be followed by the thousands of site personnel who make decisions daily that directly affect the SRS computing environment and impact the site's production and business systems. This plan must address the following requirements: There must be SRS-wide standards for procurement or development of computing systems (hardware and software). The site computing organizations must develop systems that end users find easy to use. Systems must be put in place to support the primary function of site information workers. The developers of computer systems must be given tools that automate and speed up the development of information systems and applications based on computer technology. This document describes a proposal for a site-wide computing architecture that addresses the above requirements. In summary, this architecture is standards-based data-driven, and workstation-oriented with larger systems being utilized for the delivery of needed information to users in a client-server relationship.
Nonlinear hierarchical substructural parallelism and computer architecture
NASA Technical Reports Server (NTRS)
Padovan, Joe
1989-01-01
Computer architecture is investigated in conjunction with the algorithmic structures of nonlinear finite-element analysis. To help set the stage for this goal, the development is undertaken by considering the wide-ranging needs associated with the analysis of rolling tires which possess the full range of kinematic, material and boundary condition induced nonlinearity in addition to gross and local cord-matrix material properties.
Acoustic simulation in architecture with parallel algorithm
NASA Astrophysics Data System (ADS)
Li, Xiaohong; Zhang, Xinrong; Li, Dan
2004-03-01
In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
VLSI Architectures for Computing DFT's
NASA Technical Reports Server (NTRS)
Truong, T. K.; Chang, J. J.; Hsu, I. S.; Reed, I. S.; Pei, D. Y.
1986-01-01
Simplifications result from use of residue Fermat number systems. System of finite arithmetic over residue Fermat number systems enables calculation of discrete Fourier transform (DFT) of series of complex numbers with reduced number of multiplications. Computer architectures based on approach suitable for design of very-large-scale integrated (VLSI) circuits for computing DFT's. General approach not limited to DFT's; Applicable to decoding of error-correcting codes and other transform calculations. System readily implemented in VLSI.
Efficient tree codes on SIMD computer architectures
NASA Astrophysics Data System (ADS)
Olson, Kevin M.
1996-11-01
This paper describes changes made to a previous implementation of an N -body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ~ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.
Neural-network algorithms and architectures for pattern classification
Mao, Weidong.
1991-01-01
The study of the artificial neural networks is an integrated research field that involves the disciplines of applied mathematics, physics, neurobiology, computer science, information, control, parallel processing and VLSI. This dissertation deals with a number of topics from a broad spectrum of neural network research in models, algorithms, applications and VLSI architectures. Specifically, this dissertation is aimed at studying neural network algorithms and architectures for pattern classification tasks. The work presented in this dissertation has a wide range of applications including speech recognition, image recognition, and high level knowledge processing. Supervised neural networks, such as the back-propagation network, can be used for classification tasks as the result of approximating an input/output mapping. They are the approximation-based classifiers. The original gradient descent back propagation learning algorithm exhibits slow convergence speed. Fast algorithms such as the conjugate gradient and quasi-Newton algorithms can be adopted. The main emphasis on neural network classifiers in this dissertation is the competition-based classifiers. Due to the rapid advance in VLSI technology, parallel processing, and computer aided design (CAD), application-specific VLSI systems are becoming more and more powerful and feasible. In particular, VLSI array processors offer high speed and efficiency through their massive parallelism and pipelining, regularity, modularity, and local communication. A unified VLSI array architecture can be used for implementing neural networks and Hidden Markov Models. He also proposes a pipeline interleaving approach to design VLSI array architectures for real-time image and video signal processing.
Fast semivariogram computation using FPGA architectures
NASA Astrophysics Data System (ADS)
Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang
2015-02-01
The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments
Scalable computer architecture for digital vascular systems
NASA Astrophysics Data System (ADS)
Goddard, Iain; Chao, Hui; Skalabrin, Mark
1998-06-01
Digital vascular computer systems are used for radiology and fluoroscopy (R/F), angiography, and cardiac applications. In the United States alone, about 26 million procedures of these types are performed annually: about 81% R/F, 11% cardiac, and 8% angiography. Digital vascular systems have a very wide range of performance requirements, especially in terms of data rates. In addition, new features are added over time as they are shown to be clinically efficacious. Application-specific processing modes such as roadmapping, peak opacification, and bolus chasing are particular to some vascular systems. New algorithms continue to be developed and proven, such as Cox and deJager's precise registration methods for masks and live images in digital subtraction angiography. A computer architecture must have high scalability and reconfigurability to meet the needs of this modality. Ideally, the architecture could also serve as the basis for a nonvascular R/F system.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.
1988-01-01
The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
Gropp, William D.
2014-06-23
With the coming end of Moore's law, it has become essential to develop new algorithms and techniques that can provide the performance needed by demanding computational science applications, especially those that are part of the DOE science mission. This work was part of a multi-institution, multi-investigator project that explored several approaches to develop algorithms that would be effective at the extreme scales and with the complex processor architectures that are expected at the end of this decade. The work by this group developed new performance models that have already helped guide the development of highly scalable versions of an algebraic multigrid solver, new programming approaches designed to support numerical algorithms on heterogeneous architectures, and a new, more scalable version of conjugate gradient, an important algorithm in the solution of very large linear systems of equations.
Architectural Implications for Spatial Object Association Algorithms
Kumar, V S; Kurc, T; Saltz, J; Abdulla, G; Kohn, S R; Matarazzo, C
2009-01-29
Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).
Architectural Implications for Spatial Object Association Algorithms*
Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste
2013-01-01
Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244
Quadrant architecture for fast in-place algorithms
Besslich, P.W.; Kurowski, J.O.
1983-10-01
The architecture proposed is tailored to support Radix-2/sup k/ based in-place processing of pictorial data. The algorithms make use of signal-flow graphs to describe 2-dimensional in-place operations suitable for image processing. They may be executed on a general-purpose computer but may also be supported by a special parallel architecture. Major advantages of the scheme are in-place processing and parallel access to disjoint sections of memory only. A quadtree-like decomposition of the picture prevents blocking and queuing of private and common buses. 9 references.
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
Algorithmes et architectures pour ordinateurs quantiques supraconducteurs
NASA Astrophysics Data System (ADS)
Blais, A.
2003-09-01
Algorithms and architectures for superconducting quantum computers Since its formulation, information theory was based, implicitly, on the laws of classical physics. Such a formulation is however incomplete because it does not take into account quantum reality. During the last twenty years, expansion of theory information to include quantum effects has known growing interest. The practical realization of a system for quantum data processing system, a quantum computer, presents however many challenges. In this book, we are interested in various aspects of these challenges. We start by presenting algorithmic concepts like optimization of quantum computations and geometric quantum computation. We then consider various designs and aspects of qubits based on Josephson junctions. In particular, an original approach to the interaction between superconducting qubits is presented. This approach is very general since it can be applied to various designs of qubits. Finally, we are interested in read-out of the superconductic flux qubits. The detector suggested here has the advantage that it is possible to uncouple it from the qubit when no measurement is in progress. Depuis sa formulation, la théorie de l'information a été basée, implicitement, sur les lois de la physique classique. Une telle formulation est toutefois incomplète puisqu'elle ne tient pas compte de la réalité quantique. Au cours des vingt dernières années, l'expansion de la théorie de l'information, de façon à englober les effets purement quantiques, a connu un intérêt grandissant. La réalisation d'un système de traitement de l'information quantique, un ordinateur quantique, présente toutefois de nombreux défis. Dans cet ouvrage, on s'intéresse à différents aspects concernant ces défis. On commence par présenter des concepts algorithmiques comme l'optimisation de calculs quantiques et le calcul quantique géométrique. Par la suite, on s'intéresse à différents designs et aspects de l
NASA Technical Reports Server (NTRS)
Metcalfe, A. G.; Bodenheimer, R. E.
1976-01-01
A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
A heterogeneous hierarchical architecture for real-time computing
Skroch, D.A.; Fornaro, R.J.
1988-12-01
The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
A biconjugate gradient type algorithm on massively parallel architectures
NASA Technical Reports Server (NTRS)
Freund, Roland W.; Hochbruck, Marlis
1991-01-01
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
A memory-array architecture for computer vision
Balsara, P.T.
1989-01-01
With the fast advances in the area of computer vision and robotics there is a growing need for machines that can understand images at a very high speed. A conventional von Neumann computer is not suited for this purpose because it takes a tremendous amount of time to solve most typical image processing problems. Exploiting the inherent parallelism present in various vision tasks can significantly reduce the processing time. Fortunately, parallelism is increasingly affordable as hardware gets cheaper. Thus it is now imperative to study computer vision in a parallel processing framework. The author should first design a computational structure which is well suited for a wide range of vision tasks and then develop parallel algorithms which can run efficiently on this structure. Recent advances in VLSI technology have led to several proposals for parallel architectures for computer vision. In this thesis he demonstrates that a memory array architecture with efficient local and global communication capabilities can be used for high speed execution of a wide range of computer vision tasks. This architecture, called the Access Constrained Memory Array Architecture (ACMAA), is efficient for VLSI implementation because of its modular structure, simple interconnect and limited global control. Several parallel vision algorithms have been designed for this architecture. The choice of vision problems demonstrates the versatility of ACMAA for a wide range of vision tasks. These algorithms were simulated on a high level ACMAA simulator running on the Intel iPSC/2 hypercube, a parallel architecture. The results of this simulation are compared with those of sequential algorithms running on a single hypercube node. Details of the ACMAA processor architecture are also presented.
Computational Controls Workstation: Algorithms and hardware
NASA Technical Reports Server (NTRS)
Venugopal, R.; Kumar, M.
1993-01-01
The Computational Controls Workstation provides an integrated environment for the modeling, simulation, and analysis of Space Station dynamics and control. Using highly efficient computational algorithms combined with a fast parallel processing architecture, the workstation makes real-time simulation of flexible body models of the Space Station possible. A consistent, user-friendly interface and state-of-the-art post-processing options are combined with powerful analysis tools and model databases to provide users with a complete environment for Space Station dynamics and control analysis. The software tools available include a solid modeler, graphical data entry tool, O(n) algorithm-based multi-flexible body simulation, and 2D/3D post-processors. This paper describes the architecture of the workstation while a companion paper describes performance and user perspectives.
Brain architecture: a design for natural computation.
Kaiser, Marcus
2007-12-15
Fifty years ago, John von Neumann compared the architecture of the brain with that of the computers he invented and which are still in use today. In those days, the organization of computers was based on concepts of brain organization. Here, we give an update on current results on the global organization of neural systems. For neural systems, we outline how the spatial and topological architecture of neuronal and cortical networks facilitates robustness against failures, fast processing and balanced network activation. Finally, we discuss mechanisms of self-organization for such architectures. After all, the organization of the brain might again inspire computer architecture. PMID:17855223
Computational Biology, Advanced Scientific Computing, and Emerging Computational Architectures
2007-06-27
This CRADA was established at the start of FY02 with $200 K from IBM and matching funds from DOE to support post-doctoral fellows in collaborative research between International Business Machines and Oak Ridge National Laboratory to explore effective use of emerging petascale computational architectures for the solution of computational biology problems. 'No cost' extensions of the CRADA were negotiated with IBM for FY03 and FY04.
Task scheduling in dataflow computer architectures
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
Dataflow computers provide a platform for the solution of a large class of computational problems, which includes digital signal processing and image processing. Many typical applications are represented by a set of tasks which can be repetitively executed in parallel as specified by an associated dataflow graph. Research in this area aims to model these architectures, develop scheduling procedures, and predict the transient and steady state performance. Researchers at NASA have created a model and developed associated software tools which are capable of analyzing a dataflow graph and predicting its runtime performance under various resource and timing constraints. These models and tools were extended and used in this work. Experiments using these tools revealed certain properties of such graphs that require further study. Specifically, the transient behavior at the beginning of the execution of a graph can have a significant effect on the steady state performance. Transformation and retiming of the application algorithm and its initial conditions can produce a different transient behavior and consequently different steady state performance. The effect of such transformations on the resource requirements or under resource constraints requires extensive study. Task scheduling to obtain maximum performance (based on user-defined criteria), or to satisfy a set of resource constraints, can also be significantly affected by a transformation of the application algorithm. Since task scheduling is performed by heuristic algorithms, further research is needed to determine if new scheduling heuristics can be developed that can exploit such transformations. This work has provided the initial development for further long-term research efforts. A simulation tool was completed to provide insight into the transient and steady state execution of a dataflow graph. A set of scheduling algorithms was completed which can operate in conjunction with the modeling and performance tools
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Topological Code Architectures for Quantum Computation
NASA Astrophysics Data System (ADS)
Cesare, Christopher Anthony
This dissertation is concerned with quantum computation using many-body quantum systems encoded in topological codes. The interest in these topological systems has increased in recent years as devices in the lab begin to reach the fidelities required for performing arbitrarily long quantum algorithms. The most well-studied system, Kitaev's toric code, provides both a physical substrate for performing universal fault-tolerant quantum computations and a useful pedagogical tool for explaining the way other topological codes work. In this dissertation, I first review the necessary formalism for quantum information and quantum stabilizer codes, and then I introduce two families of topological codes: Kitaev's toric code and Bombin's color codes. I then present three chapters of original work. First, I explore the distinctness of encoding schemes in the color codes. Second, I introduce a model of quantum computation based on the toric code that uses adiabatic interpolations between static Hamiltonians with gaps constant in the system size. Lastly, I describe novel state distillation protocols that are naturally suited for topological architectures and show that they provide resource savings in terms of the number of required ancilla states when compared to more traditional approaches to quantum gate approximation.
Fibonacci Numbers and Computer Algorithms.
ERIC Educational Resources Information Center
Atkins, John; Geist, Robert
1987-01-01
The Fibonacci Sequence describes a vast array of phenomena from nature. Computer scientists have discovered and used many algorithms which can be classified as applications of Fibonacci's sequence. In this article, several of these applications are considered. (PK)
Computer algorithm for coding gain
NASA Technical Reports Server (NTRS)
Dodd, E. E.
1974-01-01
Development of a computer algorithm for coding gain for use in an automated communications link design system. Using an empirical formula which defines coding gain as used in space communications engineering, an algorithm is constructed on the basis of available performance data for nonsystematic convolutional encoding with soft-decision (eight-level) Viterbi decoding.
A high performance parallel computing architecture for robust image features
NASA Astrophysics Data System (ADS)
Zhou, Renyan; Liu, Leibo; Wei, Shaojun
2014-03-01
A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Teaching Computer Aided Architectural Design at UCLA.
ERIC Educational Resources Information Center
Mitchell, William J.
This brief overview includes a rationale for the program and describes course goals and objectives, curriculum content, teaching methods and materials, staffing, and problems of integrating computer aided design with traditional architectural curricula at the School of Architecture and Urban Planning at UCLA. A list of texts for use in teaching…
SCORPIUS algorithm benchmarks on the image understanding architecture machine
NASA Astrophysics Data System (ADS)
Bogdanowicz, Julius F.; Nash, J. Gregory; Shu, David B.
1992-04-01
Many Hughes tactical and strategic programs need high performance image processing. For example, photo-interpretation applications can require up to four orders of magnitude speedup over conventional computer architectures. Therefore, parallel processing systems are needed to help close the processing gap. Vision applications can usually be decomposed into three levels of processing called high, intermediate, and low level vision. Each processing level typically requires different types of numeric/symbolic computation, processing task granularities, and communications bandwidths. No parallel processing system is commercially available that is optimized for the entire range of computations. To meet these processing challenges, the image understanding architecture (IUA) has been developed by Hughes in collaboration with the University of Massachusetts. The IUA is a heterogeneous, hierarchical, associative parallel processor that is organized in three levels corresponding to the vision problem. Its lowest level consists of a large content addressable array parallel processor. This array of 'per pixel' bit serial processors is used for fixed point, low level numeric, and symbolic computations. The middle level is an interface communications array processor (ICAP). ICAP is an array of digital signal processing chips from TI TMS320Cx line, used for high speed number crunching. The highest level is the symbolic processing array. It is an array of general purpose microprocessors in which the artificial intelligence content of the image understanding software resides. A set of benchmarks from the DARPA/ORD sponsored SCORPIUS program were developed using the IUA. The set of algorithms included low level image processing as well as high level matching algorithms. Benchmark performance on the second generation IUA hardware is over four orders of magnitude faster than equivalent algorithms implemented on a DEC VAX 8650. The first generation hardware is operational. Development
Expandable computed-tomography architecture for nondestructive inspection
NASA Astrophysics Data System (ADS)
Agi, Iskender; Hurst, Paul J.; Current, K. W.
1993-04-01
The Radon transform and its inverse, commonly used for computed tomography (CT), are computationally burdensome for single processor computers. Since projection-based computations are easily executed in parallel, multiprocessor architectures have been proposed for high-speed operation. In this paper, we describe an architecture for a high-speed (30 MHz raster-scan image data rate), high accuracy (12-bits per pixel) computed-tomography system for use in non-destructive inspection system. This architecture reconstructs images from fan- or parallel-beam data using either single-pass or iterative reconstruction techniques. Our architecture uses a number of identical processor modules in a pipeline. Each processor module consists of memory for data storage, a commercially available digital signal processing (DSP) chip for filtering, and our custom IC which performs 450 million mathematical operations per second (MOPS). This architecture can reconstruct CT images as large as 1024 X 1024 pixels from a variety of image reconstruction algorithms. The details of the implementation and performance of our expandable architecture are discussed.
Computed laminography and reconstruction algorithm
NASA Astrophysics Data System (ADS)
Que, Jie-Min; Cao, Da-Quan; Zhao, Wei; Tang, Xiao; Sun, Cui-Li; Wang, Yan-Fang; Wei, Cun-Feng; Shi, Rong-Jian; Wei, Long; Yu, Zhong-Qiang; Yan, Yong-Lian
2012-08-01
Computed laminography (CL) is an alternative to computed tomography if large objects are to be inspected with high resolution. This is especially true for planar objects. In this paper, we set up a new scanning geometry for CL, and study the algebraic reconstruction technique (ART) for CL imaging. We compare the results of ART with variant weighted functions by computer simulation with a digital phantom. It proves that ART algorithm is a good choice for the CL system.
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1991-01-01
The Theory of Intelligent Machines proposes a hierarchical organization for the functions of an autonomous robot based on the Principle of Increasing Precision With Decreasing Intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed in recent years. A computer architecture that implements the lower two levels of the intelligent machine is presented. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Details of Execution Level controllers for motion and vision systems are addressed, as well as the Petri net transducer software used to implement Coordination Level functions. Extensions to UNIX and VxWorks operating systems which enable the development of a heterogeneous, distributed application are described. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1992-01-01
The theory of intelligent machines proposes a hierarchical organization for the functions of an autonomous robot based on the principle of increasing precision with decreasing intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed. The authors present a computer architecture that implements the lower two levels of the intelligent machine. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Execution-level controllers for motion and vision systems are briefly addressed, as well as the Petri net transducer software used to implement coordination-level functions. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
A VLSI architecture for simplified arithmetic Fourier transform algorithm
NASA Technical Reports Server (NTRS)
Reed, Irving S.; Shih, Ming-Tang; Truong, T. K.; Hendon, E.; Tufts, D. W.
1992-01-01
The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed in a previous paper for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain recent AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25 percent of that used in the direct method.
Cluster algorithms and computational complexity
NASA Astrophysics Data System (ADS)
Li, Xuenan
Cluster algorithms for the 2D Ising model with a staggered field have been studied and a new cluster algorithm for path sampling has been worked out. The complexity properties of Bak-Seppen model and the Growing network model have been studied by using the Computational Complexity Theory. The dynamic critical behavior of the two-replica cluster algorithm is studied. Several versions of the algorithm are applied to the two-dimensional, square lattice Ising model with a staggered field. The dynamic exponent for the full algorithm is found to be less than 0.5. It is found that odd translations of one replica with respect to the other together with global flips are essential for obtaining a small value of the dynamic exponent. The path sampling problem for the 1D Ising model is studied using both a local algorithm and a novel cluster algorithm. The local algorithm is extremely inefficient at low temperature, where the integrated autocorrelation time is found to be proportional to the fourth power of correlation length. The dynamic exponent of the cluster algorithm is found to be zero and therefore proved to be much more efficient than the local algorithm. The parallel computational complexity of the Bak-Sneppen evolution model is studied. It is shown that Bak-Sneppen histories can be generated by a massively parallel computer in a time that is polylog in the length of the history, which means that the logical depth of producing a Bak-Sneppen history is exponentially less than the length of the history. The parallel dynamics for generating Bak-Sneppen histories is contrasted to standard Bak-Sneppen dynamics. The parallel computational complexity of the Growing Network model is studied. The growth of the network with linear kernels is shown to be not complex and an algorithm with polylog parallel running time is found. The growth of the network with gamma ≥ 2 super-linear kernels can be realized by a randomized parallel algorithm with polylog expected running time.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1987-01-01
The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
THE COMPUTER AND THE ARCHITECTURAL PROFESSION.
ERIC Educational Resources Information Center
HAVILAND, DAVID S.
THE ROLE OF ADVANCING TECHNOLOGY IN THE FIELD OF ARCHITECTURE IS DISCUSSED IN THIS REPORT. PROBLEMS IN COMMUNICATION AND THE DESIGN PROCESS ARE IDENTIFIED. ADVANTAGES AND DISADVANTAGES OF COMPUTERS ARE MENTIONED IN RELATION TO MAN AND MACHINE INTERACTION. PRESENT AND FUTURE IMPLICATIONS OF COMPUTER USAGE ARE IDENTIFIED AND DISCUSSED WITH RESPECT…
Switching from Computer to Microcomputer Architecture Education
ERIC Educational Resources Information Center
Bolanakis, Dimosthenis E.; Kotsis, Konstantinos T.; Laopoulos, Theodore
2010-01-01
In the last decades, the technological and scientific evolution of the computing discipline has been widely affecting research in software engineering education, which nowadays advocates more enlightened and liberal ideas. This article reviews cross-disciplinary research on a computer architecture class in consideration of its switching to…
FFT Computation with Systolic Arrays, A New Architecture
NASA Technical Reports Server (NTRS)
Boriakoff, Valentin
1994-01-01
The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
A fully programmable computing architecture for medical ultrasound machines.
Schneider, Fabio Kurt; Agarwal, Anup; Yoo, Yang Mo; Fukuoka, Tetsuya; Kim, Yongmin
2010-03-01
Application-specific ICs have been traditionally used to support the high computational and data rate requirements in medical ultrasound systems, particularly in receive beamforming. Utilizing the previously developed efficient front-end algorithms, in this paper, we present a simple programmable computing architecture, consisting of a field-programmable gate array (FPGA) and a digital signal processor (DSP), to support core ultrasound signal processing. It was found that 97.3% and 51.8% of the FPGA and DSP resources are, respectively, needed to support all the front-end and back-end processing for B-mode imaging with 64 channels and 120 scanlines per frame at 30 frames/s. These results indicate that this programmable architecture can meet the requirements of low- and medium-level ultrasound machines while providing a flexible platform for supporting the development and deployment of new algorithms and emerging clinical applications. PMID:19546045
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Som, Sukhamoy; Stoughton, John W.; Mielke, Roland R.
1990-01-01
Performance modeling and performance enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures are discussed. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called algorithm to architecture mapping model (ATAMM). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.; Som, Sukhamony
1990-01-01
The performance modeling and enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures is examined. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called ATAMM (Algorithm To Architecture Mapping Model). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Algorithms Bridging Quantum Computation and Chemistry
NASA Astrophysics Data System (ADS)
McClean, Jarrod Ryan
The design of new materials and chemicals derived entirely from computation has long been a goal of computational chemistry, and the governing equation whose solution would permit this dream is known. Unfortunately, the exact solution to this equation has been far too expensive and clever approximations fail in critical situations. Quantum computers offer a novel solution to this problem. In this work, we develop not only new algorithms to use quantum computers to study hard problems in chemistry, but also explore how such algorithms can help us to better understand and improve our traditional approaches. In particular, we first introduce a new method, the variational quantum eigensolver, which is designed to maximally utilize the quantum resources available in a device to solve chemical problems. We apply this method in a real quantum photonic device in the lab to study the dissociation of the helium hydride (HeH+) molecule. We also enhance this methodology with architecture specific optimizations on ion trap computers and show how linear-scaling techniques from traditional quantum chemistry can be used to improve the outlook of similar algorithms on quantum computers. We then show how studying quantum algorithms such as these can be used to understand and enhance the development of classical algorithms. In particular we use a tool from adiabatic quantum computation, Feynman's Clock, to develop a new discrete time variational principle and further establish a connection between real-time quantum dynamics and ground state eigenvalue problems. We use these tools to develop two novel parallel-in-time quantum algorithms that outperform competitive algorithms as well as offer new insights into the connection between the fermion sign problem of ground states and the dynamical sign problem of quantum dynamics. Finally we use insights gained in the study of quantum circuits to explore a general notion of sparsity in many-body quantum systems. In particular we use
The new landscape of parallel computer architecture
NASA Astrophysics Data System (ADS)
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Astrophysics Data System (ADS)
Stoughton, John W.; Mielke, Roland R.
1988-02-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
The Fermilab Central Computing Facility architectural model
Nicholls, J.
1989-05-01
The goal of the current Central Computing Upgrade at Fermilab is to create a computing environment that maximizes total productivity, particularly for high energy physics analysis. The Computing Department and the Next Computer Acquisition Committee decided upon a model which includes five components: an interactive front end, a Large-Scale Scientific Computer (LSSC, a mainframe computing engine), a microprocessor farm system, a file server, and workstations. With the exception of the file server, all segments of this model are currently in production: a VAX/VMS Cluster interactive front end, an Amdahl VM computing engine, ACP farms, and (primarily) VMS workstations. This presentation will discuss the implementation of the Fermilab Central Computing Facility Architectural Model. Implications for Code Management in such a heterogeneous environment, including issues such as modularity and centrality, will be considered. Special emphasis will be placed on connectivity and communications between the front-end, LSSC, and workstations, as practiced at Fermilab. 2 figs.
Computing on Knights and Kepler Architectures
NASA Astrophysics Data System (ADS)
Bortolotti, G.; Caberletti, M.; Crimi, G.; Ferraro, A.; Giacomini, F.; Manzali, M.; Maron, G.; Pivanti, M.; Salomoni, D.; Schifano, S. F.; Tripiccione, R.; Zanella, M.
2014-06-01
A recent trend in scientific computing is the increasingly important role of co-processors, originally built to accelerate graphics rendering, and now used for general high-performance computing. The INFN Computing On Knights and Kepler Architectures (COKA) project focuses on assessing the suitability of co-processor boards for scientific computing in a wide range of physics applications, and on studying the best programming methodologies for these systems. Here we present in a comparative way our results in porting a Lattice Boltzmann code on two state-of-the-art accelerators: the NVIDIA K20X, and the Intel Xeon-Phi. We describe our implementations, analyze results and compare with a baseline architecture adopting Intel Sandy Bridge CPUs.
Quantum perceptron over a field and neural network architecture selection in a quantum computer.
da Silva, Adenilton José; Ludermir, Teresa Bernarda; de Oliveira, Wilson Rosa
2016-04-01
In this work, we propose a quantum neural network named quantum perceptron over a field (QPF). Quantum computers are not yet a reality and the models and algorithms proposed in this work cannot be simulated in actual (or classical) computers. QPF is a direct generalization of a classical perceptron and solves some drawbacks found in previous models of quantum perceptrons. We also present a learning algorithm named Superposition based Architecture Learning algorithm (SAL) that optimizes the neural network weights and architectures. SAL searches for the best architecture in a finite set of neural network architectures with linear time over the number of patterns in the training set. SAL is the first learning algorithm to determine neural network architectures in polynomial time. This speedup is obtained by the use of quantum parallelism and a non-linear quantum operator. PMID:26878722
Evaluation of Visual Computer Simulator for Computer Architecture Education
ERIC Educational Resources Information Center
Imai, Yoshiro; Imai, Masatoshi; Moritoh, Yoshio
2013-01-01
This paper presents trial evaluation of a visual computer simulator in 2009-2011, which has been developed to play some roles of both instruction facility and learning tool simultaneously. And it illustrates an example of Computer Architecture education for University students and usage of e-Learning tool for Assembly Programming in order to…
Highly parallel computer architecture for robotic computation
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)
1991-01-01
In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Wavelet Algorithms for Illumination Computations
NASA Astrophysics Data System (ADS)
Schroder, Peter
One of the core problems of computer graphics is the computation of the equilibrium distribution of light in a scene. This distribution is given as the solution to a Fredholm integral equation of the second kind involving an integral over all surfaces in the scene. In the general case such solutions can only be numerically approximated, and are generally costly to compute, due to the geometric complexity of typical computer graphics scenes. For this computation both Monte Carlo and finite element techniques (or hybrid approaches) are typically used. A simplified version of the illumination problem is known as radiosity, which assumes that all surfaces are diffuse reflectors. For this case hierarchical techniques, first introduced by Hanrahan et al. (32), have recently gained prominence. The hierarchical approaches lead to an asymptotic improvement when only finite precision is required. The resulting algorithms have cost proportional to O(k^2 + n) versus the usual O(n^2) (k is the number of input surfaces, n the number of finite elements into which the input surfaces are meshed). Similarly a hierarchical technique has been introduced for the more general radiance problem (which allows glossy reflectors) by Aupperle et al. (6). In this dissertation we show the equivalence of these hierarchical techniques to the use of a Haar wavelet basis in a general Galerkin framework. By so doing, we come to a deeper understanding of the properties of the numerical approximations used and are able to extend the hierarchical techniques to higher orders. In particular, we show the correspondence of the geometric arguments underlying hierarchical methods to the theory of Calderon-Zygmund operators and their sparse realization in wavelet bases. The resulting wavelet algorithms for radiosity and radiance are analyzed and numerical results achieved with our implementation are reported. We find that the resulting algorithms achieve smaller and smoother errors at equivalent work.
ATCA for Machines-- Advanced Telecommunications Computing Architecture
Larsen, R.S.; /SLAC
2008-04-22
The Advanced Telecommunications Computing Architecture is a new industry open standard for electronics instrument modules and shelves being evaluated for the International Linear Collider (ILC). It is the first industrial standard designed for High Availability (HA). ILC availability simulations have shown clearly that the capabilities of ATCA are needed in order to achieve acceptable integrated luminosity. The ATCA architecture looks attractive for beam instruments and detector applications as well. This paper provides an overview of ongoing R&D including application of HA principles to power electronics systems.
Implementing a computing architecture with WISDOM
Zebrowski, J.R.
1991-01-01
Over the past two years, the Savannah River Site (SRS) work force has expanded by more than 6000 employees. This large influx of personnel, in conjunction with the limited office space, has resulted in an overcrowding problem on site. To alleviate some of the overcrowding, Westinghouse Savannah River Company (WSRC) has been in the process of leasing space from several office buildings within Aiken, SC. Brookhaven, the latest off-site office building to be leased, is the starting point for a new direction in office automation which will eventually spread throughout SRS. The computing architecture in place at Brookhaven was designed to adhere to the SRS computer architecture guidelines as published by the WSRC Computer Architecture Standards Team (CAST). At the heart of the Brookhaven implementation is a Workstation Integration System for DOS, OS/2 and Macintosh (WISDOM). The key features of the WISDOM system include: it's utilization of a Local Area Network (LAN), it's Graphical User Interface (GUI), it's cross-platform capability, it's portable user interface, and the installation program. To begin, I will give an overview of the network architecture, then discuss WISDOM in detail, mention some platform integration problems that need to be addressed and conclude with a summary of the user benefits that WISDOM provides.
Implementing a computing architecture with WISDOM
Zebrowski, J.R.
1991-12-31
Over the past two years, the Savannah River Site (SRS) work force has expanded by more than 6000 employees. This large influx of personnel, in conjunction with the limited office space, has resulted in an overcrowding problem on site. To alleviate some of the overcrowding, Westinghouse Savannah River Company (WSRC) has been in the process of leasing space from several office buildings within Aiken, SC. Brookhaven, the latest off-site office building to be leased, is the starting point for a new direction in office automation which will eventually spread throughout SRS. The computing architecture in place at Brookhaven was designed to adhere to the SRS computer architecture guidelines as published by the WSRC Computer Architecture Standards Team (CAST). At the heart of the Brookhaven implementation is a Workstation Integration System for DOS, OS/2 and Macintosh (WISDOM). The key features of the WISDOM system include: it`s utilization of a Local Area Network (LAN), it`s Graphical User Interface (GUI), it`s cross-platform capability, it`s portable user interface, and the installation program. To begin, I will give an overview of the network architecture, then discuss WISDOM in detail, mention some platform integration problems that need to be addressed and conclude with a summary of the user benefits that WISDOM provides.
Computer graphics in architecture and engineering
NASA Technical Reports Server (NTRS)
Greenberg, D. P.
1975-01-01
The present status of the application of computer graphics to the building profession or architecture and its relationship to other scientific and technical areas were discussed. It was explained that, due to the fragmented nature of architecture and building activities (in contrast to the aerospace industry), a comprehensive, economic utilization of computer graphics in this area is not practical and its true potential cannot now be realized due to the present inability of architects and structural, mechanical, and site engineers to rely on a common data base. Future emphasis will therefore have to be placed on a vertical integration of the construction process and effective use of a three-dimensional data base, rather than on waiting for any technological breakthrough in interactive computing.
DFT algorithms for bit-serial GaAs array processor architectures
NASA Technical Reports Server (NTRS)
Mcmillan, Gary B.
1988-01-01
Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Analysis of an Atom-Optical Architecture for Quantum Computation
NASA Astrophysics Data System (ADS)
Devitt, Simon J.; Stephens, Ashley M.; Munro, William J.; Nemoto, Kae
Quantum technology based on photons has emerged as one of the most promising platforms for quantum information processing, having already been used in proof-of-principle demonstrations of quantum communication and quantum computation. However, the scalability of this technology depends on the successful integration of experimentally feasible devices in an architecture that tolerates realistic errors and imperfections. Here, we analyse an atom-optical architecture for quantum computation designed to meet the requirements of scalability. The architecture is based on a modular atom-cavity device that provides an effective photon-photon interaction, allowing for the rapid, deterministic preparation of a large class of entangled states. We begin our analysis at the physical level, where we outline the experimental cavity quantum electrodynamics requirements of the basic device. Then, we describe how a scalable network of these devices can be used to prepare a three-dimensional topological cluster state, sufficient for universal fault-tolerant quantum computation. We conclude at the application level, where we estimate the system-level requirements of the architecture executing an algorithm compiled for compatibility with the topological cluster state.
Towards optoelectronic architectures for integrated neuromorphic computers
NASA Astrophysics Data System (ADS)
Martinenghi, Romain; Baylon Fuentes, Antonio; Jacquot, Maxime; Chembo, Yanne K.; Larger, Laurent
2014-03-01
We investigate theoretically and experimentally the computational properties of an optoelectronic neuromorphic processor based on a complex nonlinear dynamics. This neuromorphic approach is based on a new paradigm of or reservoir computing, which is intrinsically different from the concept of Turing machines. It essentially consists in expanding the input information to be processed into a higher dimensional phase space, through the nonlinear transient response of a complex dynamics excited by the input information. The computed output is then extracted via a linear separation of the transient trajectory in the complex phase space, performed through a learning phase consisting of the resolution of a regression problem. We here investigate an architecture for photonic neuromorphic computing via these complex nonlinear dynamical transients. A versatile photonic nonlinear transient computer based on a multiple-delay is reported. Its hybrid analogue and digital architecture allows for an easy reconfiguration, and for direct implementation of in-line processing. Its computational efficiency in parameter space is also analyzed, and the computational performance of this system is successfully evaluated on a standard spoken digit recognition task. We then discuss the pathways that can lead to its effective integration.
NASA Astrophysics Data System (ADS)
Karthik, Victor U.; Sivasuthan, Sivamayam; Hoole, Samuel Ratnajeevan H.
2014-02-01
The computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal - a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. The geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. The most efficient like conjugate gradients are very complex to program for the required derivative information. The least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. This paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. The efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
Roadmap to the SRS computing architecture
Johnson, A.
1994-07-05
This document outlines the major steps that must be taken by the Savannah River Site (SRS) to migrate the SRS information technology (IT) environment to the new architecture described in the Savannah River Site Computing Architecture. This document proposes an IT environment that is {open_quotes}...standards-based, data-driven, and workstation-oriented, with larger systems being utilized for the delivery of needed information to users in a client-server relationship.{close_quotes} Achieving this vision will require many substantial changes in the computing applications, systems, and supporting infrastructure at the site. This document consists of a set of roadmaps which provide explanations of the necessary changes for IT at the site and describes the milestones that must be completed to finish the migration.
Efficient Universal Computing Architectures for Decoding Neural Activity
Rapoport, Benjamin I.; Turicchia, Lorenzo; Wattanapanitch, Woradorn; Davidson, Thomas J.; Sarpeshkar, Rahul
2012-01-01
The ability to decode neural activity into meaningful control signals for prosthetic devices is critical to the development of clinically useful brain– machine interfaces (BMIs). Such systems require input from tens to hundreds of brain-implanted recording electrodes in order to deliver robust and accurate performance; in serving that primary function they should also minimize power dissipation in order to avoid damaging neural tissue; and they should transmit data wirelessly in order to minimize the risk of infection associated with chronic, transcutaneous implants. Electronic architectures for brain– machine interfaces must therefore minimize size and power consumption, while maximizing the ability to compress data to be transmitted over limited-bandwidth wireless channels. Here we present a system of extremely low computational complexity, designed for real-time decoding of neural signals, and suited for highly scalable implantable systems. Our programmable architecture is an explicit implementation of a universal computing machine emulating the dynamics of a network of integrate-and-fire neurons; it requires no arithmetic operations except for counting, and decodes neural signals using only computationally inexpensive logic operations. The simplicity of this architecture does not compromise its ability to compress raw neural data by factors greater than . We describe a set of decoding algorithms based on this computational architecture, one designed to operate within an implanted system, minimizing its power consumption and data transmission bandwidth; and a complementary set of algorithms for learning, programming the decoder, and postprocessing the decoded output, designed to operate in an external, nonimplanted unit. The implementation of the implantable portion is estimated to require fewer than 5000 operations per second. A proof-of-concept, 32-channel field-programmable gate array (FPGA) implementation of this portion is consequently energy efficient
NASA Technical Reports Server (NTRS)
Mielke, R.; Stoughton, J.; Som, S.; Obando, R.; Malekpour, M.; Mandala, B.
1990-01-01
A functional description of the ATAMM Multicomputer Operating System is presented. ATAMM (Algorithm to Architecture Mapping Model) is a marked graph model which describes the implementation of large grained, decomposed algorithms on data flow architectures. AMOS, the ATAMM Multicomputer Operating System, is an operating system which implements the ATAMM rules. A first generation version of AMOS which was developed for the Advanced Development Module (ADM) is described. A second generation version of AMOS being developed for the Generic VHSIC Spaceborne Computer (GVSC) is also presented.
NASA Technical Reports Server (NTRS)
Hsia, T. C.; Lu, G. Z.; Han, W. H.
1987-01-01
In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
Adaptive DNA Computing Algorithm by Using PCR and Restriction Enzyme
NASA Astrophysics Data System (ADS)
Kon, Yuji; Yabe, Kaoru; Rajaee, Nordiana; Ono, Osamu
In this paper, we introduce an adaptive DNA computing algorithm by using polymerase chain reaction (PCR) and restriction enzyme. The adaptive algorithm is designed based on Adleman-Lipton paradigm[3] of DNA computing. In this work, however, unlike the Adleman- Lipton architecture a cutting operation has been introduced to the algorithm and the mechanism in which the molecules used by computation were feedback to the next cycle devised. Moreover, the amplification by PCR is performed in the molecule used by feedback and the difference concentration arisen in the base sequence can be used again. By this operation the molecules which serve as a solution candidate can be reduced down and the optimal solution is carried out in the shortest path problem. The validity of the proposed adaptive algorithm is considered with the logical simulation and finally we go on to propose applying adaptive algorithm to the chemical experiment which used the actual DNA molecules for solving an optimal network problem.
Acoustooptic linear algebra processors - Architectures, algorithms, and applications
NASA Technical Reports Server (NTRS)
Casasent, D.
1984-01-01
Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
Accelerating scientific computations with mixed precision algorithms
NASA Astrophysics Data System (ADS)
Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack; Kurzak, Jakub; Langou, Julie; Langou, Julien; Luszczek, Piotr; Tomov, Stanimire
2009-12-01
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. Program summaryProgram title: ITER-REF Catalogue identifier: AECO_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECO_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 7211 No. of bytes in distributed program, including test data, etc.: 41 862 Distribution format: tar.gz Programming language: FORTRAN 77 Computer: desktop, server Operating system: Unix/Linux RAM: 512 Mbytes Classification: 4.8 External routines: BLAS (optional) Nature of problem: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. Solution method: Mixed precision algorithms stem from the observation that, in many cases, a single precision solution of a problem can be refined to the point where double precision accuracy is achieved. A common approach to the solution of linear systems, either dense or sparse, is to perform the LU
Lewis, P.S.
1988-10-01
Least squares techniques are widely used in adaptive signal processing. While algorithms based on least squares are robust and offer rapid convergence properties, they also tend to be complex and computationally intensive. To enable the use of least squares techniques in real-time applications, it is necessary to develop adaptive algorithms that are efficient and numerically stable, and can be readily implemented in hardware. The first part of this work presents a uniform development of general recursive least squares (RLS) algorithms, and multichannel least squares lattice (LSL) algorithms. RLS algorithms are developed for both direct estimators, in which a desired signal is present, and for mixed estimators, in which no desired signal is available, but the signal-to-data cross-correlation is known. In the second part of this work, new and more flexible techniques of mapping algorithms to array architectures are presented. These techniques, based on the synthesis and manipulation of locally recursive algorithms (LRAs), have evolved from existing data dependence graph-based approaches, but offer the increased flexibility needed to deal with the structural complexities of the RLS and LSL algorithms. Using these techniques, various array architectures are developed for each of the RLS and LSL algorithms and the associated space/time tradeoffs presented. In the final part of this work, the application of these algorithms is demonstrated by their employment in the enhancement of single-trial auditory evoked responses in magnetoencephalography. 118 refs., 49 figs., 36 tabs.
Parallel computer graphics algorithms for the Connection Machine
Richardson, J.F.
1990-01-01
Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.
Algorithms and architectures for robot vision
NASA Technical Reports Server (NTRS)
Schenker, Paul S.
1990-01-01
The scope of the current work is to develop practical sensing implementations for robots operating in complex, partially unstructured environments. A focus in this work is to develop object models and estimation techniques which are specific to requirements of robot locomotion, approach and avoidance, and grasp and manipulation. Such problems have to date received limited attention in either computer or human vision - in essence, asking not only how perception is in general modeled, but also what is the functional purpose of its underlying representations. As in the past, researchers are drawing on ideas from both the psychological and machine vision literature. Of particular interest is the development 3-D shape and motion estimates for complex objects when given only partial and uncertain information and when such information is incrementally accrued over time. Current studies consider the use of surface motion, contour, and texture information, with the longer range goal of developing a fused sensing strategy based on these sources and others.
Performance evaluation of the SX-6 vector architecture forscientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan Carter; Shalf,John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri,Jahed; Van der Wijngaart, Rob
2005-01-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor, and compares it against the cache-based IBMPower3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX-6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache-based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Resource utilization model for the algorithm to architecture mapping model
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Patel, Rakesh R.
1993-01-01
The analytical model for resource utilization and the variable node time and conditional node model for the enhanced ATAMM model for a real-time data flow architecture are presented in this research. The Algorithm To Architecture Mapping Model, ATAMM, is a Petri net based graph theoretic model developed at Old Dominion University, and is capable of modeling the execution of large-grained algorithms on a real-time data flow architecture. Using the resource utilization model, the resource envelope may be obtained directly from a given graph and, consequently, the maximum number of required resources may be evaluated. The node timing diagram for one iteration period may be obtained using the analytical resource envelope. The variable node time model, which describes the change in resource requirement for the execution of an algorithm under node time variation, is useful to expand the applicability of the ATAMM model to heterogeneous architectures. The model also describes a method of detecting the presence of resource limited mode and its subsequent prevention. Graphs with conditional nodes are shown to be reduced to equivalent graphs with time varying nodes and, subsequently, may be analyzed using the variable node time model to determine resource requirements. Case studies are performed on three graphs for the illustration of applicability of the analytical theories.
Job Superscheduler Architecture and Performance in Computational Grid Environments
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Oliker, Leonid; Biswas, Rupak
2003-01-01
Computational grids hold great promise in utilizing geographically separated heterogeneous resources to solve large-scale complex scientific problems. However, a number of major technical hurdles, including distributed resource management and effective job scheduling, stand in the way of realizing these gains. In this paper, we propose a novel grid superscheduler architecture and three distributed job migration algorithms. We also model the critical interaction between the superscheduler and autonomous local schedulers. Extensive performance comparisons with ideal, central, and local schemes using real workloads from leading computational centers are conducted in a simulation environment. Additionally, synthetic workloads are used to perform a detailed sensitivity analysis of our superscheduler. Several key metrics demonstrate that substantial performance gains can be achieved via smart superscheduling in distributed computational grids.
Developing a Distributed Computing Architecture at Arizona State University.
ERIC Educational Resources Information Center
Armann, Neil; And Others
1994-01-01
Development of Arizona State University's computing architecture, designed to ensure that all new distributed computing pieces will work together, is described. Aspects discussed include the business rationale, the general architectural approach, characteristics and objectives of the architecture, specific services, and impact on the university…
Algorithms and architectures for high performance analysis of semantic graphs.
Hendrickson, Bruce Alan
2005-09-01
analysis. Since intelligence datasets can be extremely large, the focus of this work is on the use of parallel computers. We have been working to develop scalable parallel algorithms that will be at the core of a semantic graph analysis infrastructure. Our work has involved two different thrusts, corresponding to two different computer architectures. The first architecture of interest is distributed memory, message passing computers. These machines are ubiquitous and affordable, but they are challenging targets for graph algorithms. Much of our distributed-memory work to date has been collaborative with researchers at Lawrence Livermore National Laboratory and has focused on finding short paths on distributed memory parallel machines. Our implementation on 32K processors of BlueGene/Light finds shortest paths between two specified vertices in just over a second for random graphs with 4 billion vertices.
Parallel algorithms for mapping pipelined and parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
NASA Astrophysics Data System (ADS)
Hou, Zhen-Long; Wei, Xiao-Hui; Huang, Da-Nian; Sun, Xu
2015-09-01
We apply reweighted inversion focusing to full tensor gravity gradiometry data using message-passing interface (MPI) and compute unified device architecture (CUDA) parallel computing algorithms, and then combine MPI with CUDA to formulate a hybrid algorithm. Parallel computing performance metrics are introduced to analyze and compare the performance of the algorithms. We summarize the rules for the performance evaluation of parallel algorithms. We use model and real data from the Vinton salt dome to test the algorithms. We find good match between model and real density data, and verify the high efficiency and feasibility of parallel computing algorithms in the inversion of full tensor gravity gradiometry data.
Frances: A Tool for Understanding Computer Architecture and Assembly Language
ERIC Educational Resources Information Center
Sondag, Tyler; Pokorny, Kian L.; Rajan, Hridesh
2012-01-01
Students in all areas of computing require knowledge of the computing device including software implementation at the machine level. Several courses in computer science curricula address these low-level details such as computer architecture and assembly languages. For such courses, there are advantages to studying real architectures instead of…
Chae, S.H.
1988-01-01
This dissertation describes an intelligent rule-based iterative parallel algorithm for solving randomly distributed large sparse linear systems of equations, and also the efficient parallel processing computer architecture for the implementation of the algorithm. Implemented with the Jacobi iterative method, the intelligent rule-based algorithm reduces the parallel execution time by reducing the individual inner product operation time. A static dataflow architecture is proposed for implementing the intelligent rule-based iterative parallel algorithm. The dataflow computer architecture has the capability to support parallelism exploited in the algorithm, and the execute the algorithm asynchronously. The proposed computer architecture consists of a main processor, several control processors, scalar slave processors, and pipelined slave processors. Several control processors share with the main processor the heavy burden of allocation of operation packets and of sychronization for parallel processing.
Computer architecture evaluation for structural dynamics computations: Project summary
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1989-01-01
The intent of the proposed effort is the examination of the impact of the elements of parallel architectures on the performance realized in a parallel computation. To this end, three major projects are developed: a language for the expression of high level parallelism, a statistical technique for the synthesis of multicomputer interconnection networks based upon performance prediction, and a queueing model for the analysis of shared memory hierarchies.
Verifying a Computer Algorithm Mathematically.
ERIC Educational Resources Information Center
Olson, Alton T.
1986-01-01
Presents an example of mathematics from an algorithmic point of view, with emphasis on the design and verification of this algorithm. The program involves finding roots for algebraic equations using the half-interval search algorithm. The program listing is included. (JN)
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm
Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En
2015-01-01
A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction. PMID:26287193
Hierarchical Poly Tree computer architectures defined by computational multidisciplinary mechanics
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug; Johnson, Keith
1989-01-01
This paper will develop an alternative computer architecture called the Poly Tree. Based on the requirements of computational mechanics and the concept of hierarchical substructuring, the paper will explore the development of problem-dependent parallel networks of processors which will enable significant, often superlinear, speed enhancements; provide a logical/efficient framework for linear/nonlinear and transient structural mechanics problems; and provide a logical framework from which to apply model reduction procedures. In addition, the paper will explore optimal processor arrangements which define the overall system granularity. Consideration will also be given to system I/O requirements.
A high throughput architecture for a low complexity soft-output demapping algorithm
NASA Astrophysics Data System (ADS)
Ali, I.; Wasenmüller, U.; Wehn, N.
2015-11-01
Iterative channel decoders such as Turbo-Code and LDPC decoders show exceptional performance and therefore they are a part of many wireless communication receivers nowadays. These decoders require a soft input, i.e., the logarithmic likelihood ratio (LLR) of the received bits with a typical quantization of 4 to 6 bits. For computing the LLR values from a received complex symbol, a soft demapper is employed in the receiver. The implementation cost of traditional soft-output demapping methods is relatively large in high order modulation systems, and therefore low complexity demapping algorithms are indispensable in low power receivers. In the presence of multiple wireless communication standards where each standard defines multiple modulation schemes, there is a need to have an efficient demapper architecture covering all the flexibility requirements of these standards. Another challenge associated with hardware implementation of the demapper is to achieve a very high throughput in double iterative systems, for instance, MIMO and Code-Aided Synchronization. In this paper, we present a comprehensive communication and hardware performance evaluation of low complexity soft-output demapping algorithms to select the best algorithm for implementation. The main goal of this work is to design a high throughput, flexible, and area efficient architecture. We describe architectures to execute the investigated algorithms. We implement these architectures on a FPGA device to evaluate their hardware performance. The work has resulted in a hardware architecture based on the figured out best low complexity algorithm delivering a high throughput of 166 Msymbols/second for Gray mapped 16-QAM modulation on Virtex-5. This efficient architecture occupies only 127 slice registers, 248 slice LUTs and 2 DSP48Es.
QPSO-Based Adaptive DNA Computing Algorithm
Karakose, Mehmet; Cigdem, Ugur
2013-01-01
DNA (deoxyribonucleic acid) computing that is a new computation model based on DNA molecules for information storage has been increasingly used for optimization and data analysis in recent years. However, DNA computing algorithm has some limitations in terms of convergence speed, adaptability, and effectiveness. In this paper, a new approach for improvement of DNA computing is proposed. This new approach aims to perform DNA computing algorithm with adaptive parameters towards the desired goal using quantum-behaved particle swarm optimization (QPSO). Some contributions provided by the proposed QPSO based on adaptive DNA computing algorithm are as follows: (1) parameters of population size, crossover rate, maximum number of operations, enzyme and virus mutation rate, and fitness function of DNA computing algorithm are simultaneously tuned for adaptive process, (2) adaptive algorithm is performed using QPSO algorithm for goal-driven progress, faster operation, and flexibility in data, and (3) numerical realization of DNA computing algorithm with proposed approach is implemented in system identification. Two experiments with different systems were carried out to evaluate the performance of the proposed approach with comparative results. Experimental results obtained with Matlab and FPGA demonstrate ability to provide effective optimization, considerable convergence speed, and high accuracy according to DNA computing algorithm. PMID:23935409
A Biomimetic Adaptive Algorithm and Low-Power Architecture for Implantable Neural Decoders
Rapoport, Benjamin I.; Wattanapanitch, Woradorn; Penagos, Hector L.; Musallam, Sam; Andersen, Richard A.; Sarpeshkar, Rahul
2010-01-01
Algorithmically and energetically efficient computational architectures that operate in real time are essential for clinically useful neural prosthetic devices. Such devices decode raw neural data to obtain direct control signals for external devices. They can also perform data compression and vastly reduce the bandwidth and consequently power expended in wireless transmission of raw data from implantable brain-machine interfaces. We describe a biomimetic algorithm and micropower analog circuit architecture for decoding neural cell ensemble signals. The decoding algorithm implements a continuous-time artificial neural network, using a bank of adaptive linear filters with kernels that emulate synaptic dynamics. The filters transform neural signal inputs into control-parameter outputs, and can be tuned automatically in an on-line learning process. We provide experimental validation of our system using neural data from thalamic head-direction cells in an awake behaving rat. PMID:19964345
Algorithm architecture co-design for ultra low-power image sensor
NASA Astrophysics Data System (ADS)
Laforest, T.; Dupret, A.; Verdant, A.; Lattard, D.; Villard, P.
2012-03-01
In a context of embedded video surveillance, stand alone leftbehind image sensors are used to detect events with high level of confidence, but also with a very low power consumption. Using a steady camera, motion detection algorithms based on background estimation to find regions in movement are simple to implement and computationally efficient. To reduce power consumption, the background is estimated using a down sampled image formed of macropixels. In order to extend the class of moving objects to be detected, we propose an original mixed mode architecture developed thanks to an algorithm architecture co-design methodology. This programmable architecture is composed of a vector of SIMD processors. A basic RISC architecture was optimized in order to implement motion detection algorithms with a dedicated set of 42 instructions. Definition of delta modulation as a calculation primitive has allowed to implement algorithms in a very compact way. Thereby, a 1920x1080@25fps CMOS image sensor performing integrated motion detection is proposed with a power estimation of 1.8 mW.
A High Performance COTS Based Computer Architecture
NASA Astrophysics Data System (ADS)
Patte, Mathieu; Grimoldi, Raoul; Trautner, Roland
2014-08-01
Using Commercial Off The Shelf (COTS) electronic components for space applications is a long standing idea. Indeed the difference in processing performance and energy efficiency between radiation hardened components and COTS components is so important that COTS components are very attractive for use in mass and power constrained systems. However using COTS components in space is not straightforward as one must account with the effects of the space environment on the COTS components behavior. In the frame of the ESA funded activity called High Performance COTS Based Computer, Airbus Defense and Space and its subcontractor OHB CGS have developed and prototyped a versatile COTS based architecture for high performance processing. The rest of the paper is organized as follows: in a first section we will start by recapitulating the interests and constraints of using COTS components for space applications; then we will briefly describe existing fault mitigation architectures and present our solution for fault mitigation based on a component called the SmartIO; in the last part of the paper we will describe the prototyping activities executed during the HiP CBC project.
Algorithmic Mechanism Design of Evolutionary Computation
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
Algorithmic Mechanism Design of Evolutionary Computation.
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
A flexible algorithm for calculating pair interactions on SIMD architectures
NASA Astrophysics Data System (ADS)
Páll, Szilárd; Hess, Berk
2013-12-01
Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.
NASA Technical Reports Server (NTRS)
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
From Point Clouds to Architectural Models: Algorithms for Shape Reconstruction
NASA Astrophysics Data System (ADS)
Canciani, M.; Falcolini, C.; Saccone, M.; Spadafora, G.
2013-02-01
The use of terrestrial laser scanners in architectural survey applications has become more and more common. Row data complexity, as given by scanner restitution, leads to several problems about design and 3D-modelling starting from Point Clouds. In this context we present a study on architectural sections and mathematical algorithms for their shape reconstruction, according to known or definite geometrical rules, focusing on shapes of different complexity. Each step of the semi-automatic algorithm has been developed using Mathematica software and CAD, integrating both programs in order to reconstruct a geometrical CAD model of the object. Our study is motivated by the fact that, for architectural survey, most of three dimensional modelling procedures concerning point clouds produce superabundant, but often unnecessary, information and are also very expensive in terms of cpu time using more and more sophisticated hardware and software. On the contrary, it's important to simplify/decimate the point cloud in order to recognize a particular form out of some definite geometric/architectonic shapes. Such a process consists of several steps: first the definition of plane sections and characterization of their architecture; secondly the construction of a continuous plane curve depending on some parameters. In the third step we allow the selection on the curve of some nodal points with given specific characteristics (symmetry, tangency conditions, shadowing exclusion, corners, … ). The fourth and last step is the construction of a best shape defined by the comparison with an abacus of known geometrical elements, such as moulding profiles, leading to a precise architectonical section. The algorithms have been developed and tested in very different situations and are presented in a case study of complex geometries such as some mouldings profiles in the Church of San Carlo alle Quattro Fontane.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Architecture and applications of the HEP multiprocessor computer system
Smith, B.J.
1981-01-01
The HEP computer system is a large scale scientific parallel computer employing shared-resource MIMD architecture. The hardware and software facilities provided by the system are described, and techniques found useful in programming the system are discussed. 3 references.
Mutual Algorithm-Architecture Analysis for Real - Parallel Systems in Particle Physics Experiments.
NASA Astrophysics Data System (ADS)
Ni, Ping
Data acquisition from particle colliders requires real-time detection of tracks and energy clusters from collision events occurring at intervals of tens of mus. Beginning with the specification of a benchmark track-finding algorithm, parallel implementations have been developed. A revision of the routing scheme for performing reductions such as a tree sum, called the reduced routing distance scheme, has been developed and analyzed. The scheme reduces inter-PE communication time for narrow communication channel systems. A new parallel algorithm, called the interleaved tree sum, for parallel reduction problems has been developed that increases efficiency of processor use. Detailed analysis of this algorithm with different routing schemes is presented. Comparable parallel algorithms are analyzed, also taking into account the architectural parameters that play an important role in this parallel algorithm analysis. Computation and communication times are analyzed to guide the design of a custom system based on a massively parallel processing component. Developing an optimal system requires mutual analysis of algorithm and architecture parameters. It is shown that matching a processor array size to the parallelism of the problem does not always produce the best system design. Based on promising benchmark simulation results, an application specific hardware prototype board, called Dasher, has been built using two Blitzen chips. The processing array is a mesh-connected SIMD system with 256 PEs. Its design is discussed, with details on the software environment.
NASA Technical Reports Server (NTRS)
Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra
1989-01-01
In part 1 architecture of NETRA is presented. A performance evaluation of NETRA using several common vision algorithms is also presented. Performance of algorithms when they are mapped on one cluster is described. It is shown that SIMD, MIMD, and systolic algorithms can be easily mapped onto processor clusters, and almost linear speedups are possible. For some algorithms, analytical performance results are compared with implementation performance results. It is observed that the analysis is very accurate. Performance analysis of parallel algorithms when mapped across clusters is presented. Mappings across clusters illustrate the importance and use of shared as well as distributed memory in achieving high performance. The parameters for evaluation are derived from the characteristics of the parallel algorithms, and these parameters are used to evaluate the alternative communication strategies in NETRA. Furthermore, the effect of communication interference from other processors in the system on the execution of an algorithm is studied. Using the analysis, performance of many algorithms with different characteristics is presented. It is observed that if communication speeds are matched with the computation speeds, good speedups are possible when algorithms are mapped across clusters.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical
Computing Algorithms for Nuffield Advanced Physics.
ERIC Educational Resources Information Center
Summers, M. K.
1978-01-01
Defines all recurrence relations used in the Nuffield course, to solve first- and second-order differential equations, and describes a typical algorithm for computer generation of solutions. (Author/GA)
Machine Computation; An Algorithmic Approach.
ERIC Educational Resources Information Center
Gonzalez, Richard F.; McMillan, Claude, Jr.
Designed for undergraduate social science students, this textbook concentrates on using the computer in a straightforward way to manipulate numbers and variables in order to solve problems. The text is problem oriented and assumes that the student has had little prior experience with either a computer or programing languages. An introduction to…
NASA Astrophysics Data System (ADS)
Lee, Seungbeom; Lee, Hanho
This paper presents a novel high-speed low-complexity pipelined degree-computationless modified Euclidean (pDCME) algorithm architecture for high-speed RS decoders. The pDCME algorithm allows elimination of the degree-computation so as to reduce hardware complexity and obtain high-speed processing. A high-speed RS decoder based on the pDCME algorithm has been designed and implemented with 0.13-μm CMOS standard cell technology in a supply voltage of 1.1V. The proposed RS decoder operates at a clock frequency of 660MHz and has a throughput of 5.3Gb/s. The proposed architecture requires approximately 15% fewer gate counts and a simpler control logic than architectures based on the popular modified Euclidean algorithm.
Modern hardware architectures accelerate porous media flow computations
NASA Astrophysics Data System (ADS)
Kulczewski, Michal; Kurowski, Krzysztof; Kierzynka, Michal; Dohnalik, Marek; Kaczmarczyk, Jan; Borujeni, Ali Takbiri
2012-05-01
Investigation of rock properties, porosity and permeability particularly, which determines transport media characteristic, is crucial to reservoir engineering. Nowadays, micro-tomography (micro-CT) methods allow to obtain vast of petro-physical properties. The micro-CT method facilitates visualization of pores structures and acquisition of total porosity factor, determined by sticking together 2D slices of scanned rock and applying proper absorption cut-off point. Proper segmentation of pores representation in 3D is important to solve the permeability of porous media. This factor is recently determined by the means of Computational Fluid Dynamics (CFD), a popular method to analyze problems related to fluid flows, taking advantage of numerical methods and constantly growing computing powers. The recent advent of novel multi-, many-core and graphics processing unit (GPU) hardware architectures allows scientists to benefit even more from parallel processing and built-in new features. The high level of parallel scalability offers both, the time-to-solution decrease and greater accuracy - top factors in reservoir engineering. This paper aims to present research results related to fluid flow simulations, particularly solving the total porosity and permeability of porous media, taking advantage of modern hardware architectures. In our approach total porosity is calculated by the means of general-purpose computing on multiple GPUs. This application sticks together 2D slices of scanned rock and by the means of a marching tetrahedra algorithm, creates a 3D representation of pores and calculates the total porosity. Experimental results are compared with data obtained via other popular methods, including Nuclear Magnetic Resonance (NMR), helium porosity and nitrogen permeability tests. Then CFD simulations are performed on a large-scale high performance hardware architecture to solve the flow and permeability of porous media. In our experiments we used Lattice Boltzmann
Algorithms for computing the multivariable stability margin
NASA Technical Reports Server (NTRS)
Tekawy, Jonathan A.; Safonov, Michael G.; Chiang, Richard Y.
1989-01-01
Stability margin for multiloop flight control systems has become a critical issue, especially in highly maneuverable aircraft designs where there are inherent strong cross-couplings between the various feedback control loops. To cope with this issue, we have developed computer algorithms based on non-differentiable optimization theory. These algorithms have been developed for computing the Multivariable Stability Margin (MSM). The MSM of a dynamical system is the size of the smallest structured perturbation in component dynamics that will destabilize the system. These algorithms have been coded and appear to be reliable. As illustrated by examples, they provide the basis for evaluating the robustness and performance of flight control systems.
Performance Analysis of Cloud Computing Architectures Using Discrete Event Simulation
NASA Technical Reports Server (NTRS)
Stocker, John C.; Golomb, Andrew M.
2011-01-01
Cloud computing offers the economic benefit of on-demand resource allocation to meet changing enterprise computing needs. However, the flexibility of cloud computing is disadvantaged when compared to traditional hosting in providing predictable application and service performance. Cloud computing relies on resource scheduling in a virtualized network-centric server environment, which makes static performance analysis infeasible. We developed a discrete event simulation model to evaluate the overall effectiveness of organizations in executing their workflow in traditional and cloud computing architectures. The two part model framework characterizes both the demand using a probability distribution for each type of service request as well as enterprise computing resource constraints. Our simulations provide quantitative analysis to design and provision computing architectures that maximize overall mission effectiveness. We share our analysis of key resource constraints in cloud computing architectures and findings on the appropriateness of cloud computing in various applications.
An events based algorithm for distributing concurrent tasks on multi-core architectures
NASA Astrophysics Data System (ADS)
Holmes, David W.; Williams, John R.; Tilke, Peter
2010-02-01
In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a "hungry" pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.
Cryogenic Control Architecture for Large-Scale Quantum Computing
NASA Astrophysics Data System (ADS)
Hornibrook, J. M.; Colless, J. I.; Conway Lamb, I. D.; Pauka, S. J.; Lu, H.; Gossard, A. C.; Watson, J. D.; Gardner, G. C.; Fallahi, S.; Manfra, M. J.; Reilly, D. J.
2015-02-01
Solid-state qubits have recently advanced to the level that enables them, in principle, to be scaled up into fault-tolerant quantum computers. As these physical qubits continue to advance, meeting the challenge of realizing a quantum machine will also require the development of new supporting devices and control architectures with complexity far beyond the systems used in today's few-qubit experiments. Here, we report a microarchitecture for controlling and reading out qubits during the execution of a quantum algorithm such as an error-correcting code. We demonstrate the basic principles of this architecture using a cryogenic switch matrix implemented via high-electron-mobility transistors and a new kind of semiconductor device based on gate-switchable capacitance. The switch matrix is used to route microwave waveforms to qubits under the control of a field-programmable gate array, also operating at cryogenic temperatures. Taken together, these results suggest a viable approach for controlling large-scale quantum systems using semiconductor technology.
An improved distance matrix computation algorithm for multicore clusters.
Al-Neama, Mohammed W; Reda, Naglaa M; Ghaleb, Fayed F M
2014-01-01
Distance matrix has diverse usage in different research areas. Its computation is typically an essential task in most bioinformatics applications, especially in multiple sequence alignment. The gigantic explosion of biological sequence databases leads to an urgent need for accelerating these computations. DistVect algorithm was introduced in the paper of Al-Neama et al. (in press) to present a recent approach for vectorizing distance matrix computing. It showed an efficient performance in both sequential and parallel computing. However, the multicore cluster systems, which are available now, with their scalability and performance/cost ratio, meet the need for more powerful and efficient performance. This paper proposes DistVect1 as highly efficient parallel vectorized algorithm with high performance for computing distance matrix, addressed to multicore clusters. It reformulates DistVect1 vectorized algorithm in terms of clusters primitives. It deduces an efficient approach of partitioning and scheduling computations, convenient to this type of architecture. Implementations employ potential of both MPI and OpenMP libraries. Experimental results show that the proposed method performs improvement of around 3-fold speedup upon SSE2. Further it also achieves speedups more than 9 orders of magnitude compared to the publicly available parallel implementation utilized in ClustalW-MPI. PMID:25013779
A digital architecture for support vector machines: theory, algorithm, and FPGA implementation.
Anguita, D; Boni, A; Ridella, S
2003-01-01
In this paper, we propose a digital architecture for support vector machine (SVM) learning and discuss its implementation on a field programmable gate array (FPGA). We analyze briefly the quantization effects on the performance of the SVM in classification problems to show its robustness, in the feedforward phase, respect to fixed-point math implementations; then, we address the problem of SVM learning. The architecture described here makes use of a new algorithm for SVM learning which is less sensitive to quantization errors respect to the solution appeared so far in the literature. The algorithm is composed of two parts: the first one exploits a recurrent network for finding the parameters of the SVM; the second one uses a bisection process for computing the threshold. The architecture implementing the algorithm is described in detail and mapped on a real current-generation FPGA (Xilinx Virtex II). Its effectiveness is then tested on a channel equalization problem, where real-time performances are of paramount importance. PMID:18244555
Multithreaded processor architecture for parallel symbolic computation. Technical report
Fujita, T.
1987-09-01
This paper describes the Multilisp Architecture for Symbolic Applications (MASA), which is a multithreaded processor architecture for parallel symbolic computation with various features intended for effective Multilisp program execution. The principal mechanisms exploited for this processor are multiple contexts, interleaved pipeline execution from separate instruction streams, and synchronization based on a bit in each memory cell. The tagged architecture approach is taken for Lisp program execution, and trap conditions are provided for future object manipulation and garbage collection.
Pipeline and parallel architectures for computer communication systems
Reddi, A.V.
1983-01-01
Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.
Resource Efficient Hardware Architecture for Fast Computation of Running Max/Min Filters
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k2 − 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
Resource efficient hardware architecture for fast computation of running max/min filters.
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k(2) - 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
A random search algorithm for laboratory computers
NASA Technical Reports Server (NTRS)
Curry, R. E.
1975-01-01
The small laboratory computer is ideal for experimental control and data acquisition. Postexperimental data processing is often performed on large computers because of the availability of sophisticated programs, but costs and data compatibility are negative factors. Parameter optimization can be accomplished on the small computer, offering ease of programming, data compatibility, and low cost. A previously proposed random-search algorithm ('random creep') was found to be very slow in convergence. A method is proposed (the 'random leap' algorithm) which starts in a global search mode and automatically adjusts step size to speed convergence. A FORTRAN executive program for the random-leap algorithm is presented which calls a user-supplied function subroutine. An example of a function subroutine is given which calculates maximum-likelihood estimates of receiver operating-characteristic parameters from binary response data. Other applications in parameter estimation, generalized least squares, and matrix inversion are discussed.
A new hardware-efficient algorithm and reconfigurable architecture for image contrast enhancement.
Huang, Shih-Chia; Chen, Wen-Chieh
2014-10-01
Contrast enhancement is crucial when generating high quality images for image processing applications, such as digital image or video photography, liquid crystal display processing, and medical image analysis. In order to achieve real-time performance for high-definition video applications, it is necessary to design efficient contrast enhancement hardware architecture to meet the needs of real-time processing. In this paper, we propose a novel hardware-oriented contrast enhancement algorithm which can be implemented effectively for hardware design. In order to be considered for hardware implementation, approximation techniques are proposed to reduce these complex computations during performance of the contrast enhancement algorithm. The proposed hardware-oriented contrast enhancement algorithm achieves good image quality by measuring the results of qualitative and quantitative analyzes. To decrease hardware cost and improve hardware utilization for real-time performance, a reduction in circuit area is proposed through use of parameter-controlled reconfigurable architecture. The experiment results show that the proposed hardware-oriented contrast enhancement algorithm can provide an average frame rate of 48.23 frames/s at high definition resolution 1920 × 1080. PMID:25148665
Parallel algorithms for computing linked list prefix
Han, Y. )
1989-06-01
Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.
Integrated computer control system architectural overview
Van Arsdall, P.
1997-06-18
This overview introduces the NIF Integrated Control System (ICCS) architecture. The design is abstract to allow the construction of many similar applications from a common framework. This summary lays the essential foundation for understanding the model-based engineering approach used to execute the design.
Implementation of the FDK algorithm for cone-beam CT on the cell broadband engine architecture
NASA Astrophysics Data System (ADS)
Scherl, Holger; Koerner, Mario; Hofmann, Hannes; Eckert, Wieland; Kowarschik, Markus; Hornegger, Joachim
2007-03-01
In most of today's commercially available cone-beam CT scanners, the well known FDK method is used for solving the 3D reconstruction task. The computational complexity of this algorithm prohibits its use for many medical applications without hardware acceleration. The brand-new Cell Broadband Engine Architecture (CBEA) with its high level of parallelism is a cost-efficient processor for performing the FDK reconstruction according to the medical requirements. The programming scheme, however, is quite different to any standard personal computer hardware. In this paper, we present an innovative implementation of the most time-consuming parts of the FDK algorithm: filtering and back-projection. We also explain the required transformations to parallelize the algorithm for the CBEA. Our software framework allows to compute the filtering and back-projection in parallel, making it possible to do an on-the-fly-reconstruction. The achieved results demonstrate that a complete FDK reconstruction is computed with the CBEA in less than seven seconds for a standard clinical scenario. Given the fact that scan times are usually much higher, we conclude that reconstruction is finished right after the end of data acquisition. This enables us to present the reconstructed volume to the physician in real-time, immediately after the last projection image has been acquired by the scanning device.
A computational architecture for social agents
Bond, A.H.
1996-12-31
This article describes a new class of information-processing models for social agents. They axe derived from primate brain architecture, the processing in brain regions, the interactions among brain regions, and the social behavior of primates. In another paper, we have reviewed the neuroanatomical connections and functional involvements of cortical regions. We reviewed the evidence for a hierarchical architecture in the primate brain. By examining neuroanatomical evidence for connections among neural areas, we were able to establish anatomical regions and connections. We then examined evidence for specific functional involvements of the different neural axeas and found some support for hierarchical functioning, not only for the perception hierarchies but also for the planning and action hierarchy in the frontal lobes.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
A Simple Physical Optics Algorithm Perfect for Parallel Computing
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1993-01-01
One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Computational algorithms for simulations in atmospheric optics.
Konyaev, P A; Lukin, V P
2016-04-20
A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors. PMID:27140113
Middleware in Modern High Performance Computing System Architectures
Engelmann, Christian; Ong, Hong Hoe; Scott, Stephen L
2007-01-01
A recent trend in modern high performance computing (HPC) system architectures employs ''lean'' compute nodes running a lightweight operating system (OS). Certain parts of the OS as well as other system software services are moved to service nodes in order to increase performance and scalability. This paper examines the impact of this HPC system architecture trend on HPC ''middleware'' software solutions, which traditionally equip HPC systems with advanced features, such as parallel and distributed programming models, appropriate system resource management mechanisms, remote application steering and user interaction techniques. Since the approach of keeping the compute node software stack small and simple is orthogonal to the middleware concept of adding missing OS features between OS and application, the role and architecture of middleware in modern HPC systems needs to be revisited. The result is a paradigm shift in HPC middleware design, where single middleware services are moved to service nodes, while runtime environments (RTEs) continue to reside on compute nodes.
Computer Architecture. (Latest Citations from the Aerospace Database)
NASA Technical Reports Server (NTRS)
1996-01-01
The bibliography contains citations concerning research and development in the field of computer architecture. Design of computer systems, microcomputer components, and digital networks are among the topics discussed. Multimicroprocessor system performance, software development, and aerospace avionics applications are also included. (Contains 50-250 citations and includes a subject term index and title list.)
The Contribution of Visualization to Learning Computer Architecture
ERIC Educational Resources Information Center
Yehezkel, Cecile; Ben-Ari, Mordechai; Dreyfus, Tommy
2007-01-01
This paper describes a visualization environment and associated learning activities designed to improve learning of computer architecture. The environment, EasyCPU, displays a model of the components of a computer and the dynamic processes involved in program execution. We present the results of a research program that analysed the contribution of…
Architecture and applications of the HEP multiprocessor computer system
Smith, B.J.; Fink, D.J.
1982-01-01
The HEP computer system is a large scale scientific parallel computer employing shared resource MIMD architecture. The hardware and software facilities provided by the system are described, and techniques found to be useful in programming the system are also discussed. 3 references.
Fault tolerant hypercube computer system architecture
NASA Technical Reports Server (NTRS)
Madan, Herb S. (Inventor); Chow, Edward (Inventor)
1989-01-01
A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node
Algorithm-dependent fault tolerance for distributed computing
P. D. Hough; M. e. Goldsby; E. J. Walsh
2000-02-01
Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.
Architectures of Kepler Planet Systems with Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Morehead, Robert C.; Ford, Eric B.
2015-12-01
The distribution of period normalized transit duration ratios among Kepler’s multiple transiting planet systems constrains the distributions of mutual orbital inclinations and orbital eccentricities. However, degeneracies in these parameters tied to the underlying number of planets in these systems complicate their interpretation. To untangle the true architecture of planet systems, the mutual inclination, eccentricity, and underlying planet number distributions must be considered simultaneously. The complexities of target selection, transit probability, detection biases, vetting, and follow-up observations make it impractical to write an explicit likelihood function. Approximate Bayesian computation (ABC) offers an intriguing path forward. In its simplest form, ABC generates a sample of trial population parameters from a prior distribution to produce synthetic datasets via a physically-motivated forward model. Samples are then accepted or rejected based on how close they come to reproducing the actual observed dataset to some tolerance. The accepted samples form a robust and useful approximation of the true posterior distribution of the underlying population parameters. We build on the considerable progress from the field of statistics to develop sequential algorithms for performing ABC in an efficient and flexible manner. We demonstrate the utility of ABC in exoplanet populations and present new constraints on the distributions of mutual orbital inclinations, eccentricities, and the relative number of short-period planets per star. We conclude with a discussion of the implications for other planet occurrence rate calculations, such as eta-Earth.
Architecture independent environment for developing engineering software on MIMD computers
NASA Technical Reports Server (NTRS)
Valimohamed, Karim A.; Lopez, L. A.
1990-01-01
Engineers are constantly faced with solving problems of increasing complexity and detail. Multiple Instruction stream Multiple Data stream (MIMD) computers have been developed to overcome the performance limitations of serial computers. The hardware architectures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. First, the issues that must be considered to develop those tools are examined. The two main areas of concern were architecture independence and data management. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the investment of time and effort that goes into developing the software. The management of data is a crucial aspect of solving large engineering problems. It must be considered in light of the new hardware organizations that are available. Second, the functional design and implementation of a software environment that facilitates developing architecture independent software for large engineering applications are described. The topics of discussion include: a description of the model that supports the development of architecture independent software; identifying and exploiting concurrency within the application program; data coherence; engineering data base and memory management.
Thrifty: An Exascale Architecture for Energy Proportional Computing
Torrellas, Josep
2014-12-23
The objective of this project is to design key aspects of an exascale architecture called Thrifty that addresses the challenges of power/energy efficiency, resiliency, and performance in exascale systems. The project includes work on computer architecture (Josep Torrellas from University of Illinois), compilation (Daniel Quinlan from Lawrence Livermore National Laboratory), runtime and applications (Laura Carrington from University of California San Diego), and circuits (Wilfred Pinfold from Intel Corporation).
Heavy Lift Vehicle (HLV) Avionics Flight Computing Architecture Study
NASA Technical Reports Server (NTRS)
Hodson, Robert F.; Chen, Yuan; Morgan, Dwayne R.; Butler, A. Marc; Sdhuh, Joseph M.; Petelle, Jennifer K.; Gwaltney, David A.; Coe, Lisa D.; Koelbl, Terry G.; Nguyen, Hai D.
2011-01-01
A NASA multi-Center study team was assembled from LaRC, MSFC, KSC, JSC and WFF to examine potential flight computing architectures for a Heavy Lift Vehicle (HLV) to better understand avionics drivers. The study examined Design Reference Missions (DRMs) and vehicle requirements that could impact the vehicles avionics. The study considered multiple self-checking and voting architectural variants and examined reliability, fault-tolerance, mass, power, and redundancy management impacts. Furthermore, a goal of the study was to develop the skills and tools needed to rapidly assess additional architectures should requirements or assumptions change.
Iterative algorithms for large sparse linear systems on parallel computers
NASA Technical Reports Server (NTRS)
Adams, L. M.
1982-01-01
Algorithms for assembling in parallel the sparse system of linear equations that result from finite difference or finite element discretizations of elliptic partial differential equations, such as those that arise in structural engineering are developed. Parallel linear stationary iterative algorithms and parallel preconditioned conjugate gradient algorithms are developed for solving these systems. In addition, a model for comparing parallel algorithms on array architectures is developed and results of this model for the algorithms are given.
Aerodynamic optimization studies on advanced architecture computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana
1995-01-01
The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures
Datta, Kaushik; Murphy, Mark; Volkov, Vasily; Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Patterson, David; Shalf, John; Yelick, Katherine
2008-08-22
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.
Fast computation algorithms for speckle pattern simulation
Nascov, Victor; Samoilă, Cornel; Ursuţiu, Doru
2013-11-13
We present our development of a series of efficient computation algorithms, generally usable to calculate light diffraction and particularly for speckle pattern simulation. We use mainly the scalar diffraction theory in the form of Rayleigh-Sommerfeld diffraction formula and its Fresnel approximation. Our algorithms are based on a special form of the convolution theorem and the Fast Fourier Transform. They are able to evaluate the diffraction formula much faster than by direct computation and we have circumvented the restrictions regarding the relative sizes of the input and output domains, met on commonly used procedures. Moreover, the input and output planes can be tilted each to other and the output domain can be off-axis shifted.
Algorithms for parallel flow solvers on message passing architectures
NASA Technical Reports Server (NTRS)
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
LINCS: Livermore's network architecture. [Octopus computing network
Fletcher, J.G.
1982-01-01
Octopus, a local computing network that has been evolving at the Lawrence Livermore National Laboratory for over fifteen years, is currently undergoing a major revision. The primary purpose of the revision is to consolidate and redefine the variety of conventions and formats, which have grown up over the years, into a single standard family of protocols, the Livermore Interactive Network Communication Standard (LINCS). This standard treats the entire network as a single distributed operating system such that access to a computing resource is obtained in a single way, whether that resource is local (on the same computer as the accessing process) or remote (on another computer). LINCS encompasses not only communication but also such issues as the relationship of customer to server processes and the structure, naming, and protection of resources. The discussion includes: an overview of the Livermore user community and computing hardware, the functions and structure of each of the seven layers of LINCS protocol, the reasons why we have designed our own protocols and why we are dissatisfied by the directions that current protocol standards are taking.
Computational architecture for integrated controls and structures design
NASA Technical Reports Server (NTRS)
Belvin, W. Keith; Park, K. C.
1989-01-01
To facilitate the development of control structure interaction (CSI) design methodology, a computational architecture for interdisciplinary design of active structures is presented. The emphasis of the computational procedure is to exploit existing sparse matrix structural analysis techniques, in-core data transfer with control synthesis programs, and versatility in the optimization methodology to avoid unnecessary structural or control calculations. The architecture is designed such that all required structure, control and optimization analyses are performed within one program. Hence, the optimization strategy is not unduly constrained by cold starts of existing structural analysis and control synthesis packages.
Layered Architectures for Quantum Computers and Quantum Repeaters
NASA Astrophysics Data System (ADS)
Jones, Nathan C.
This chapter examines how to organize quantum computers and repeaters using a systematic framework known as layered architecture, where machine control is organized in layers associated with specialized tasks. The framework is flexible and could be used for analysis and comparison of quantum information systems. To demonstrate the design principles in practice, we develop architectures for quantum computers and quantum repeaters based on optically controlled quantum dots, showing how a myriad of technologies must operate synchronously to achieve fault-tolerance. Optical control makes information processing in this system very fast, scalable to large problem sizes, and extendable to quantum communication.
Panel on future directions in parallel computer architecture
VanTilborg, A.M. )
1989-06-01
One of the program highlights of the 15th Annual International Symposium on Computer Architecture, held May 30 - June 2, 1988 in Honolulu, was a panel session on future directions in parallel computer architecture. The panel was organized and chaired by the author, and was comprised of Prof. Jack Dennis (NASA Ames Research Institute for Advanced Computer Science), Prof. H.T. Kung (Carnegie Mellon), and Dr. Burton Smith (Tera Computer Company). The objective of the panel was to identify the likely trajectory of future parallel computer system progress, particularly from the sandpoint of marketplace acceptance. Approximately 250 attendees participated in the session, in which each panelist began with a ten minute viewgraph explanation of his views, followed by an open and sometimes lively exchange with the audience and fellow panelists. The session ran for ninety minutes.
Architecture and grid application of cluster computing system
NASA Astrophysics Data System (ADS)
Lv, Yi; Yu, Shuiqin; Mao, Youju
2004-11-01
Recently, people pay more attention to the grid technology. It can not only connect all kinds of resources in the network, but also put them into a super transparent computing environment for customers to realize mete-computing which can share computing resources. Traditional parallel computing system, such as SMP(Symmetrical multiprocessor) and MPP(massively parallel processor), use multi-processors to raise computing speed in a close coupling way, so the flexible and scalable performance of the system are limited, as a result of it, the system can't meet the requirement of the grid technology. In this paper, the architecture of cluster computing system applied in grid nodes is introduced. It mainly includes the following aspects. First, the network architecture of cluster computing system in grid nodes is analyzed and designed. Second, how to realize distributing computing (including coordinating computing and sharing computing) of cluster computing system in grid nodes to construct virtual node computers is discussed. Last, communication among grid nodes is analyzed. In other words, it discusses how to realize single reflection to let all the service requirements from customers be met through sending to the grid nodes.
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable
Schuller, Ivan K.; Stevens, Rick; Pino, Robinson; Pechan, Michael
2015-10-29
Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS based technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.
NASA Astrophysics Data System (ADS)
Trigueros-Espinosa, Blas; Vélez-Reyes, Miguel; Santiago-Santiago, Nayda G.; Rosario-Torres, Samuel
2011-06-01
Hyperspectral sensors can collect hundreds of images taken at different narrow and contiguously spaced spectral bands. This high-resolution spectral information can be used to identify materials and objects within the field of view of the sensor by their spectral signature, but this process may be computationally intensive due to the large data sizes generated by the hyperspectral sensors, typically hundreds of megabytes. This can be an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In this work, we developed a parallel implementation of three state-ofthe- art target detection algorithms (RX algorithm, matched filter and adaptive matched subspace detector) using a graphics processing unit (GPU) based on the NVIDIA® CUDA™ architecture. In addition, a multi-core CPUbased implementation of each algorithm was developed to be used as a baseline for the speedups estimation. We evaluated the performance of the GPU-based implementations using an NVIDIA ® Tesla® C1060 GPU card, and the detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU-based implementation, which suggests that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware.
A single user efficiency measure for evaluation of parallel or pipeline computer architectures
NASA Technical Reports Server (NTRS)
Jones, W. P.
1978-01-01
A precise statement of the relationship between sequential computation at one rate, parallel or pipeline computation at a much higher rate, the data movement rate between levels of memory, the fraction of inherently sequential operations or data that must be processed sequentially, the fraction of data to be moved that cannot be overlapped with computation, and the relative computational complexity of the algorithms for the two processes, scalar and vector, was developed. The relationship should be applied to the multirate processes that obtain in the employment of various new or proposed computer architectures for computational aerodynamics. The relationship, an efficiency measure that the single user of the computer system perceives, argues strongly in favor of separating scalar and vector processes, sometimes referred to as loosely coupled processes, to achieve optimum use of hardware.
Parallel algorithms for optical digital computers
Huang, A.
1983-01-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.
A computationally efficient particle-simulation method suited to vector-computer architectures
McDonald, J.D.
1990-01-01
Recent interest in a National Aero-Space Plane (NASP) and various Aero-assisted Space Transfer Vehicles (ASTVs) presents the need for a greater understanding of high-speed rarefied flight conditions. Particle simulation techniques such as the Direct Simulation Monte Carlo (DSMC) method are well suited to such problems, but the high cost of computation limits the application of the methods to two-dimensional or very simple three-dimensional problems. This research re-examines the algorithmic structure of existing particle simulation methods and re-structures them to allow efficient implementation on vector-oriented supercomputers. A brief overview of the DSMC method and the Cray-2 vector computer architecture are provided, and the elements of the DSMC method that inhibit substantial vectorization are identified. One such element is the collision selection algorithm. A complete reformulation of underlying kinetic theory shows that this may be efficiently vectorized for general gas mixtures. The mechanics of collisions are vectorizable in the DSMC method, but several optimizations are suggested that greatly enhance performance. Also this thesis proposes a new mechanism for the exchange of energy between vibration and other energy modes. The developed scheme makes use of quantized vibrational states and is used in place of the Borgnakke-Larsen model. Finally, a simplified representation of physical space and boundary conditions is utilized to further reduce the computational cost of the developed method. Comparison to solutions obtained from the DSMC method for the relaxation of internal energy modes in a homogeneous gas, as well as single and multiple specie shock wave profiles, are presented. Additionally, a large scale simulation of the flow about the proposed Aeroassisted Flight Experiment (AFE) vehicle is included as an example of the new computational capability of the developed particle simulation method.
Fast search algorithms for computational protein design.
Traoré, Seydou; Roberts, Kyle E; Allouche, David; Donald, Bruce R; André, Isabelle; Schiex, Thomas; Barbe, Sophie
2016-05-01
One of the main challenges in computational protein design (CPD) is the huge size of the protein sequence and conformational space that has to be computationally explored. Recently, we showed that state-of-the-art combinatorial optimization technologies based on Cost Function Network (CFN) processing allow speeding up provable rigid backbone protein design methods by several orders of magnitudes. Building up on this, we improved and injected CFN technology into the well-established CPD package Osprey to allow all Osprey CPD algorithms to benefit from associated speedups. Because Osprey fundamentally relies on the ability of A* to produce conformations in increasing order of energy, we defined new A* strategies combining CFN lower bounds, with new side-chain positioning-based branching scheme. Beyond the speedups obtained in the new A*-CFN combination, this novel branching scheme enables a much faster enumeration of suboptimal sequences, far beyond what is reachable without it. Together with the immediate and important speedups provided by CFN technology, these developments directly benefit to all the algorithms that previously relied on the DEE/ A* combination inside Osprey* and make it possible to solve larger CPD problems with provable algorithms. PMID:26833706
Parallel algorithms for computational geometry utilizing a fixed number of processors
Strader, R.G.
1988-01-01
The design of algorithms for systems where both communication and computation are important is presented. Approaches to parallel computation and the underlying theoretical models are surveyed. two models of computation are developed, both based on a divide-and-conquer strategy. The first utilizes a tree-like merge resulting in several levels of communication and computation, the total number determined by the number of processors. The second model contains a fixed number of levels independent of the number of processors. Using the notation from the survey and the models of computation, algorithms are designed for the computational geometry problems of finding the convex hull and Delaunay triangulation for a set of uniform random points in the Euclidean plane. Communication and computation timing measurements based on these algorithms are presented and analyzed. The results are then generalized to predict the behavior of expanded problems. Architectural support, partitioning issues, and limitations of this approach are discussed.
Pipelined CPU Design with FPGA in Teaching Computer Architecture
ERIC Educational Resources Information Center
Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon
2012-01-01
This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
In-Memory Computing Architectures for Sparse Distributed Memory.
Kang, Mingu; Shanbhag, Naresh R
2016-08-01
This paper presents an energy-efficient and high-throughput architecture for Sparse Distributed Memory (SDM)-a computational model of the human brain [1]. The proposed SDM architecture is based on the recently proposed in-memory computing kernel for machine learning applications called Compute Memory (CM) [2], [3]. CM achieves energy and throughput efficiencies by deeply embedding computation into the memory array. SDM-specific techniques such as hierarchical binary decision (HBD) are employed to reduce the delay and energy further. The CM-based SDM (CM-SDM) is a mixed-signal circuit, and hence circuit-aware behavioral, energy, and delay models in a 65 nm CMOS process are developed in order to predict system performance of SDM architectures in the auto- and hetero-associative modes. The delay and energy models indicate that CM-SDM, in general, can achieve up to 25 × and 12 × delay and energy reduction, respectively, over conventional SDM. When classifying 16 × 16 binary images with high noise levels (input bad pixel ratios: 15%-25%) into nine classes, all SDM architectures are able to generate output bad pixel ratios (Bo) ≤ 2%. The CM-SDM exhibits negligible loss in accuracy, i.e., its Bo degradation is within 0.4% as compared to that of the conventional SDM. PMID:27305686
Computational algorithms to predict Gene Ontology annotations
2015-01-01
Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper
VLSI architectures for computing multiplications and inverses in GF(2m).
Wang, C C; Truong, T K; Shao, H M; Deutsch, L J; Omura, J K; Reed, I S
1985-08-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation. PMID:11539660
VLSI architectures for computing multiplications and inverses in GF(2-m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.; Reed, I. S.
1983-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
VLSI architectures for computing multiplications and inverses in GF(2m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.
1985-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
Efficient Algorithm and Architecture of Critical-Band Transform for Low-Power Speech Applications
NASA Astrophysics Data System (ADS)
Wang, Chao; Gan, Woon-Seng
2007-12-01
An efficient algorithm and its corresponding VLSI architecture for the critical-band transform (CBT) are developed to approximate the critical-band filtering of the human ear. The CBT consists of a constant-bandwidth transform in the lower frequency range and a Brown constant-[InlineEquation not available: see fulltext.] transform (CQT) in the higher frequency range. The corresponding VLSI architecture is proposed to achieve significant power efficiency by reducing the computational complexity, using pipeline and parallel processing, and applying the supply voltage scaling technique. A 21-band Bark scale CBT processor with a sampling rate of 16 kHz is designed and simulated. Simulation results verify its suitability for performing short-time spectral analysis on speech. It has a better fitting on the human ear critical-band analysis, significantly fewer computations, and therefore is more energy-efficient than other methods. With a 0.35[InlineEquation not available: see fulltext.]m CMOS technology, it calculates a 160-point speech in 4.99 milliseconds at 234 kHz. The power dissipation is 15.6[InlineEquation not available: see fulltext.]W at 1.1 V. It achieves 82.1[InlineEquation not available: see fulltext.] power reduction as compared to a benchmark 256-point FFT processor.
Quantum computations: algorithms and error correction
NASA Astrophysics Data System (ADS)
Kitaev, A. Yu
1997-12-01
Contents §0. Introduction §1. Abelian problem on the stabilizer §2. Classical models of computations2.1. Boolean schemes and sequences of operations2.2. Reversible computations §3. Quantum formalism3.1. Basic notions and notation3.2. Transformations of mixed states3.3. Accuracy §4. Quantum models of computations4.1. Definitions and basic properties4.2. Construction of various operators from the elements of a basis4.3. Generalized quantum control and universal schemes §5. Measurement operators §6. Polynomial quantum algorithm for the stabilizer problem §7. Computations with perturbations: the choice of a model §8. Quantum codes (definitions and general properties)8.1. Basic notions and ideas8.2. One-to-one codes8.3. Many-to-one codes §9. Symplectic (additive) codes9.1. Algebraic preparation9.2. The basic construction9.3. Error correction procedure9.4. Torus codes §10. Error correction in the computation process: general principles10.1. Definitions and results10.2. Proofs §11. Error correction: concrete procedures11.1. The symplecto-classical case11.2. The case of a complete basis Bibliography
Block sparse Cholesky algorithms on advanced uniprocessor computers
Ng, E.G.; Peyton, B.W.
1991-12-01
As with many other linear algebra algorithms, devising a portable implementation of sparse Cholesky factorization that performs well on the broad range of computer architectures currently available is a formidable challenge. Even after limiting our attention to machines with only one processor, as we have done in this report, there are still several interesting issues to consider. For dense matrices, it is well known that block factorization algorithms are the best means of achieving this goal. We take this approach for sparse factorization as well. This paper has two primary goals. First, we examine two sparse Cholesky factorization algorithms, the multifrontal method and a blocked left-looking sparse Cholesky method, in a systematic and consistent fashion, both to illustrate the strengths of the blocking techniques in general and to obtain a fair evaluation of the two approaches. Second, we assess the impact of various implementation techniques on time and storage efficiency, paying particularly close attention to the work-storage requirement of the two methods and their variants.
Algorithmic cooling and scalable NMR quantum computers
Boykin, P. Oscar; Mor, Tal; Roychowdhury, Vwani; Vatan, Farrokh; Vrijen, Rutger
2002-01-01
We present here algorithmic cooling (via polarization heat bath)—a powerful method for obtaining a large number of highly polarized spins in liquid nuclear-spin systems at finite temperature. Given that spin-half states represent (quantum) bits, algorithmic cooling cleans dirty bits beyond the Shannon's bound on data compression, by using a set of rapidly thermal-relaxing bits. Such auxiliary bits could be implemented by using spins that rapidly get into thermal equilibrium with the environment, e.g., electron spins. Interestingly, the interaction with the environment, usually a most undesired interaction, is used here to our benefit, allowing a cooling mechanism. Cooling spins to a very low temperature without cooling the environment could lead to a breakthrough in NMR experiments, and our “spin-refrigerating” method suggests that this is possible. The scaling of NMR ensemble computers is currently one of the main obstacles to building larger-scale quantum computing devices, and our spin-refrigerating method suggests that this problem can be resolved. PMID:11904402
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Woodruff, S.B.
1992-05-01
The Transient Reactor Analysis Code (TRAC), which features a two- fluid treatment of thermal-hydraulics, is designed to model transients in water reactors and related facilities. One of the major computational costs associated with TRAC and similar codes is calculating constitutive coefficients. Although the formulations for these coefficients are local the costs are flow-regime- or data-dependent; i.e., the computations needed for a given spatial node often vary widely as a function of time. Consequently, poor load balancing will degrade efficiency on either vector or data parallel architectures when the data are organized according to spatial location. Unfortunately, a general automatic solution to the load-balancing problem associated with data-dependent computations is not yet available for massively parallel architectures. This document discusses why developers algorithms, such as a neural net representation, that do not exhibit algorithms, such as a neural net representation, that do not exhibit load-balancing problems.
Portable computer system architecture for the Space Station Freedom program
NASA Technical Reports Server (NTRS)
Alena, Richard; Liu, Yuan-Kwei; Fernquist, Alan R.
1993-01-01
This paper outlines various mission requirements and technical approaches that support the potential use of portable computers in several defined activities within the Space Station Freedom (SSF) program. Specifically, the use of portable computers as consoles for both spacecraft control and payload applications is presented. Various issues and proposed solutions regarding the incorporation of portable computers within the program are presented. The primary issues presented regard architecture (standard interface for expansion, advanced processors and displays), integration (methods of high-speed data communication, peripheral interfaces, and interconnectivity within various support networks), and evolution (wireless communications and multimedia data interface methods).
Reviews of computing technology: A review of compound document architectures
Hudson, B.J.
1991-10-01
This review of computing technology will define, describe, and give examples of various approaches to document management through the use of compound document architectures. Experts agree that only 10% of business information exists in machine readable form, but much of what is stored is not in useful form. As a result, the average business document is copied over a dozen times during its life and duplicate copies are stored in numerous locations. The goal of compound document architectures is to provide an information support environment where rapid access to the correct information in the proper format is simplified. A compound document architecture provides structure to seemingly unstructured electronic documents, and standardizes the methods for interchange and access of entire or partial documents by authors and users.
Architecture Framework for Trapped-Ion Quantum Computer based on Performance Simulation Tool
NASA Astrophysics Data System (ADS)
Ahsan, Muhammad
The challenge of building scalable quantum computer lies in striking appropriate balance between designing a reliable system architecture from large number of faulty computational resources and improving the physical quality of system components. The detailed investigation of performance variation with physics of the components and the system architecture requires adequate performance simulation tool. In this thesis we demonstrate a software tool capable of (1) mapping and scheduling the quantum circuit on a realistic quantum hardware architecture with physical resource constraints, (2) evaluating the performance metrics such as the execution time and the success probability of the algorithm execution, and (3) analyzing the constituents of these metrics and visualizing resource utilization to identify system components which crucially define the overall performance. Using this versatile tool, we explore vast design space for modular quantum computer architecture based on trapped ions. We find that while success probability is uniformly determined by the fidelity of physical quantum operation, the execution time is a function of system resources invested at various layers of design hierarchy. At physical level, the number of lasers performing quantum gates, impact the latency of the fault-tolerant circuit blocks execution. When these blocks are used to construct meaningful arithmetic circuit such as quantum adders, the number of ancilla qubits for complicated non-clifford gates and entanglement resources to establish long-distance communication channels, become major performance limiting factors. Next, in order to factorize large integers, these adders are assembled into modular exponentiation circuit comprising bulk of Shor's algorithm. At this stage, the overall scaling of resource-constraint performance with the size of problem, describes the effectiveness of chosen design. By matching the resource investment with the pace of advancement in hardware technology
A pipelined IC architecture for radon transform computations in a multiprocessor array
Agi, I.; Hurst, P.J.; Current, K.W. . Dept. of Electrical Engineering and Computer Science)
1990-05-25
The amount of data generated by CT scanners is enormous, making the reconstruction operation slow, especially for 3-D and limited-data scans requiring iterative algorithms. The Radon transform and its inverse, commonly used for CT image reconstruction from projections, are computationally burdensome for today's single-processor computer architectures. If the processing times for the forward and inverse Radon transforms were comparatively small, a large set of new CT algorithms would become feasible, especially those for 3-D and iterative tomographic image reconstructions. In addition to image reconstruction, a fast Radon Transform Computer'' could be naturally applied in other areas of multidimensional signal processing including 2-D power spectrum estimation, modeling of human perception, Hough transforms, image representation, synthetic aperture radar processing, and others. A high speed processor for this operation is likely to motivate new algorithms for general multidimensional signal processing using the Radon transform. In the proposed workshop paper, we will first describe interpolation schemes useful in computation of the discrete Radon transform and backprojection and compare their errors and hardware complexities. We then will evaluate through statistical means the fixed-point number system required to accept and generate 12-bit input and output data with acceptable error using the linear interpolation scheme selected. These results set some of the requirements that must be met by our new VLSI chip architecture. Finally we will present a new unified architecture for a single-chip processor for computing both the forward Radon transform and backprojection at high data rates. 3 refs., 2 figs.
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
QNIX: A Linear Optical Architecture for Quantum Computing
NASA Astrophysics Data System (ADS)
Gimeno-Segovia, Mercedes; Shadbolt, Peter J.; Rudolph, Terry G.; Browne, Dan E.; Mendoza, Gabriel; Russell, Nicholas J.; Silverstone, Joshua W.; Santamato, Alberto; Carolan, Jacques; O'Brien, Jeremy
2015-03-01
There is currently a great deal of effort to develop a large-scale quantum computer, and one of the most promising platforms to do so is integrated linear optics. We present a proposal for a dynamical scheme for an integrated linear optics implementation of a one-way quantum computer. We go beyond the purely theoretical work and address practical issues in order to create a physically realistic design. We describe every step of cluster state construction and processing, showing the outstanding issues left to be addressed and our contributions to the different stages of the dynamical process. These include optimised interferometers for the generation of GHZ states, a universal and scalable architecture which requires entangled sources of no more than 3 photons with no active feed-forward, and loss-tolerant and fault-tolerant strategies specifically tailored to our proposed architecture. Our work demonstrates that building a linear optical quantum computer need be less challenging than previously thought, and brings large-scale switch-free linear optical architectures for quantum computing much closer to experimental realisation.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
NASA Astrophysics Data System (ADS)
da Silva, C. R.; da Silveira, P.; Wentzcovitch, R. M.; Pierce, M.; Erlebacher, G.
2007-12-01
We present an overview of the VLab, a system developed to handle execution of extensive workflows generated by first principles computations of thermoelastic properties of minerals. The multiplicity (102-3) of tasks derives from sampling of parameter space with variables such as pressure, temperature, strain, composition, etc. We review the algorithms of physical importance that define the system's requirements, its underlying service oriented architecture (SOA), and metadata. The system architecture emerges naturally. The SOA is a collection of web-services providing access to distributed computing nodes, controlling workflow execution, monitoring services, and providing data analyses tools, visualization services, data bases, and authentication services. A usage view diagram is described. We also show snapshots taken from the actual operational procedure in VLab. Research supported by NSF/ITR (VLab)
VLab: A Service Oriented Architecture for Distributed First Principles Materials Computations
NASA Astrophysics Data System (ADS)
da Silva, Cesar; da Silveira, Pedro; Wentzcovitch, Renata; Pierce, Marlon; Erlebacher, Gordon
2008-03-01
We present an overview of VLab, a system developed to handle execution of extensive workflows generated by first principles computations of thermoelastic properties of minerals. The multiplicity (10^2-3) of tasks derives from sampling of parameter space with variables such as pressure, temperature, strain, composition, etc. We review the algorithms of physical importance that define the system's requirements, its underlying service oriented architecture (SOA), and metadata. The system architecture emerges naturally. The SOA is a collection of web-services providing access to distributed computing nodes, workflow control, and monitoring services, and providing data analysis tools, visualization services, data bases, and authentication services. A usage view diagram is described. We also show snapshots taken from the actual operational procedure in VLab.
Nanotube devices based crossbar architecture: toward neuromorphic computing.
Zhao, W S; Agnus, G; Derycke, V; Filoramo, A; Bourgoin, J-P; Gamrat, C
2010-04-30
Nanoscale devices such as carbon nanotube and nanowires based transistors, memristors and molecular devices are expected to play an important role in the development of new computing architectures. While their size represents a decisive advantage in terms of integration density, it also raises the critical question of how to efficiently address large numbers of densely integrated nanodevices without the need for complex multi-layer interconnection topologies similar to those used in CMOS technology. Two-terminal programmable devices in crossbar geometry seem particularly attractive, but suffer from severe addressing difficulties due to cross-talk, which implies complex programming procedures. Three-terminal devices can be easily addressed individually, but with limited gain in terms of interconnect integration. We show how optically gated carbon nanotube devices enable efficient individual addressing when arranged in a crossbar geometry with shared gate electrodes. This topology is particularly well suited for parallel programming or learning in the context of neuromorphic computing architectures. PMID:20368686
Multi-level Hierarchical Poly Tree computer architectures
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug
1990-01-01
Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
Modified Golomb Algorithm for Computing Unit Fraction Expansions
ERIC Educational Resources Information Center
Man, Yiu-Kwong
2004-01-01
In this note, a modified Golomb algorithm for computing unit fraction expansions is presented. This algorithm has the advantage that the maximal denominators involved in the expansions will not exceed those computed by the original algorithm. In fact, the differences between the maximal denominators or the number of terms obtained by these two…
Thermodynamic cost of computation, algorithmic complexity and the information metric
NASA Technical Reports Server (NTRS)
Zurek, W. H.
1989-01-01
Algorithmic complexity is discussed as a computational counterpart to the second law of thermodynamics. It is shown that algorithmic complexity, which is a measure of randomness, sets limits on the thermodynamic cost of computations and casts a new light on the limitations of Maxwell's demon. Algorithmic complexity can also be used to define distance between binary strings.
Integration of nanoscale memristor synapses in neuromorphic computing architectures
NASA Astrophysics Data System (ADS)
Indiveri, Giacomo; Linares-Barranco, Bernabé; Legenstein, Robert; Deligeorgis, George; Prodromakis, Themistoklis
2013-09-01
Conventional neuro-computing architectures and artificial neural networks have often been developed with no or loose connections to neuroscience. As a consequence, they have largely ignored key features of biological neural processing systems, such as their extremely low-power consumption features or their ability to carry out robust and efficient computation using massively parallel arrays of limited precision, highly variable, and unreliable components. Recent developments in nano-technologies are making available extremely compact and low power, but also variable and unreliable solid-state devices that can potentially extend the offerings of availing CMOS technologies. In particular, memristors are regarded as a promising solution for modeling key features of biological synapses due to their nanoscale dimensions, their capacity to store multiple bits of information per element and the low energy required to write distinct states. In this paper, we first review the neuro- and neuromorphic computing approaches that can best exploit the properties of memristor and scale devices, and then propose a novel hybrid memristor-CMOS neuromorphic circuit which represents a radical departure from conventional neuro-computing approaches, as it uses memristors to directly emulate the biophysics and temporal dynamics of real synapses. We point out the differences between the use of memristors in conventional neuro-computing architectures and the hybrid memristor-CMOS circuit proposed, and argue how this circuit represents an ideal building block for implementing brain-inspired probabilistic computing paradigms that are robust to variability and fault tolerant by design.
Domain decomposition algorithms and computation fluid dynamics
NASA Technical Reports Server (NTRS)
Chan, Tony F.
1988-01-01
In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.
Hardware architecture for full analytical Fraunhofer computer-generated holograms
NASA Astrophysics Data System (ADS)
Pang, Zhi-Yong; Xu, Zong-Xi; Xiong, Yi; Chen, Biao; Dai, Hui-Min; Jiang, Shao-Ji; Dong, Jian-Wen
2015-09-01
Hardware architecture of parallel computation is proposed for generating Fraunhofer computer-generated holograms (CGHs). A pipeline-based integrated circuit architecture is realized by employing the modified Fraunhofer analytical formulism, which is large scale and enables all components to be concurrently operated. The architecture of the CGH contains five modules to calculate initial parameters of amplitude, amplitude compensation, phases, and phase compensation, respectively. The precalculator of amplitude is fully adopted considering the "reusable design" concept. Each complex operation type (such as square arithmetic) is reused only once by means of a multichannel selector. The implemented hardware calculates an 800×600 pixels hologram in parallel using 39,319 logic elements, 21,074 registers, and 12,651 memory bits in an Altera field-programmable gate array environment with stable operation at 50 MHz. Experimental results demonstrate that the quality of the images reconstructed from the hardware-generated hologram can be comparable to that of a software implementation. Moreover, the calculation speed is approximately 100 times faster than that of a personal computer with an Intel i5-3230M 2.6 GHz CPU for a triangular object.
Algorithmic support for commodity-based parallel computing systems.
Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
The Snowcloud System: Architecture and Algorithms for Snow Hydrology Studies
NASA Astrophysics Data System (ADS)
Skalka, C.; Brown, I.; Frolik, J.
2013-12-01
Snowcloud is an embedded data collection system for snow hydrology field research campaigns conducted in harsh climates and remote areas. The system combines distributed wireless sensor network technology and computational techniques to provide data at lower cost and higher spatio-temporal resolution than ground-based systems using traditional methods. Snowcloud has seen multiple Winter deployments in settings ranging from high desert to arctic, resulting in over a dozen node-years of practical experience. The Snowcloud system architecture consists of multiple TinyOS mesh-networked sensor stations collecting environmental data above and, in some deployments, below the snowpack. Monitored data modalities include snow depth, ground and air temperature, PAR and leaf-area index (LAI), and soil moisture. To enable power cycling and control of multiple sensors a custom power and sensor conditioning board was developed. The electronics and structural systems for individual stations have been designed and tested (in the lab and in situ) for ease of assembly and robustness to harsh winter conditions. Battery systems and solar chargers enable seasonal operation even under low/no light arctic conditions. Station costs range between 500 and 1000 depending on the instrumentation suite. For remote field locations, a custom designed hand-held device and data retrieval protocol serves as the primary data collection method. We are also developing and testing a Gateway device that will report data in near-real-time (NRT) over a cellular connection. Data is made available to users via web interfaces that also provide basic data analysis and visualization tools. For applications to snow hydrology studies, the better spatiotemporal resolution of snowpack data provided by Snowcloud is beneficial in several aspects. It provides insight into snowpack evolution, and allows us to investigate differences across different spatial and temporal scales in deployment areas. It enables the
Architectural requirements for the Red Storm computing system.
Camp, William J.; Tomkins, James Lee
2003-10-01
This report is based on the Statement of Work (SOW) describing the various requirements for delivering 3 new supercomputer system to Sandia National Laboratories (Sandia) as part of the Department of Energy's (DOE) Accelerated Strategic Computing Initiative (ASCI) program. This system is named Red Storm and will be a distributed memory, massively parallel processor (MPP) machine built primarily out of commodity parts. The requirements presented here distill extensive architectural and design experience accumulated over a decade and a half of research, development and production operation of similar machines at Sandia. Red Storm will have an unusually high bandwidth, low latency interconnect, specially designed hardware and software reliability features, a light weight kernel compute node operating system and the ability to rapidly switch major sections of the machine between classified and unclassified computing environments. Particular attention has been paid to architectural balance in the design of Red Storm, and it is therefore expected to achieve an atypically high fraction of its peak speed of 41 TeraOPS on real scientific computing applications. In addition, Red Storm is designed to be upgradeable to many times this initial peak capability while still retaining appropriate balance in key design dimensions. Installation of the Red Storm computer system at Sandia's New Mexico site is planned for 2004, and it is expected that the system will be operated for a minimum of five years following installation.
Scaling to Nanotechnology Limits with the PIMS Computer Architecture and a new Scaling Rule.
Debenedictis, Erik
2015-02-01
We describe a new approach to computing that moves towards the limits of nanotechnology using a newly formulated sc aling rule. This is in contrast to the current computer industry scali ng away from von Neumann's original computer at the rate of Moore's Law. We extend Moore's Law to 3D, which l eads generally to architectures that integrate logic and memory. To keep pow er dissipation cons tant through a 2D surface of the 3D structure requires using adiabatic principles. We call our newly proposed architecture Processor In Memory and Storage (PIMS). We propose a new computational model that integrates processing and memory into "tiles" that comprise logic, memory/storage, and communications functions. Since the programming model will be relatively stable as a system scales, programs repr esented by tiles could be executed in a PIMS system built with today's technology or could become the "schematic diagram" for implementation in an ultimate 3D nanotechnology of the future. We build a systems software approach that offers advantages over and above the technological and arch itectural advantages. Firs t, the algorithms may be more efficient in the conventional sens e of having fewer steps. Second, the algorithms may run with higher power efficiency per operation by being a better match for the adiabatic scaling ru le. The performance analysis based on demonstrated ideas in physical science suggests 80,000 x improvement in cost per operation for the (arguably) gene ral purpose function of emulating neurons in Deep Learning.
A Component Architecture for High-Performance Scientific Computing
Bernholdt, D E; Allan, B A; Armstrong, R; Bertrand, F; Chiu, K; Dahlgren, T L; Damevski, K; Elwasif, W R; Epperly, T W; Govindaraju, M; Katz, D S; Kohl, J A; Krishnan, M; Kumfert, G; Larson, J W; Lefantzi, S; Lewis, M J; Malony, A D; McInnes, L C; Nieplocha, J; Norris, B; Parker, S G; Ray, J; Shende, S; Windus, T L; Zhou, S
2004-12-14
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
A Component Architecture for High-Performance Scientific Computing
Bernholdt, David E; Allan, Benjamin A; Armstrong, Robert C; Bertrand, Felipe; Chiu, Kenneth; Dahlgren, Tamara L; Damevski, Kostadin; Elwasif, Wael R; Epperly, Thomas G; Govindaraju, Madhusudhan; Katz, Daniel S; Kohl, James A; Krishnan, Manoj Kumar; Kumfert, Gary K; Larson, J Walter; Lefantzi, Sophia; Lewis, Michael J; Malony, Allen D; McInnes, Lois C; Nieplocha, Jarek; Norris, Boyana; Parker, Steven G; Ray, Jaideep; Shende, Sameer; Windus, Theresa L; Zhou, Shujia
2006-07-03
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
HiMAT onboard flight computer system architecture and qualification
NASA Technical Reports Server (NTRS)
Myers, A. F.; Earls, M. R.; Callizo, L. A.
1981-01-01
Two highly maneuverable aircraft technology (HiMAT) remotely piloted research vehicles (RPRV's) are being flight tested at NASA Dryden Flight Research Center, Edwards, California, to demonstrate and evaluate a number of technological advances applicable to future fighter aircraft. Closed-loop primary flight control is performed from a ground-based cockpit utilizing a digital computer and up/down telemetry links. A backup flight control system for emergency operation resides in one of two onboard computers. Other functions of the onboard computer system are uplink processing, downlink processing, engine control, failure detection, and redundancy management. This paper describes the architecture, functions, and flight qualification of the HiMAT onboard flight computer systems.
Reveal, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures
NASA Technical Reports Server (NTRS)
Liang, Shoudan; Fuhrman, Stefanie; Somogyi, Roland
1998-01-01
Given the immanent gene expression mapping covering whole genomes during development, health and disease, we seek computational methods to maximize functional inference from such large data sets. Is it possible, in principle, to completely infer a complex regulatory network architecture from input/output patterns of its variables? We investigated this possibility using binary models of genetic networks. Trajectories, or state transition tables of Boolean nets, resemble time series of gene expression. By systematically analyzing the mutual information between input states and output states, one is able to infer the sets of input elements controlling each element or gene in the network. This process is unequivocal and exact for complete state transition tables. We implemented this REVerse Engineering ALgorithm (REVEAL) in a C program, and found the problem to be tractable within the conditions tested so far. For n = 50 (elements) and k = 3 (inputs per element), the analysis of incomplete state transition tables (100 state transition pairs out of a possible 10(exp 15)) reliably produced the original rule and wiring sets. While this study is limited to synchronous Boolean networks, the algorithm is generalizable to include multi-state models, essentially allowing direct application to realistic biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields.
Distributed sequence alignment applications for the public computing architecture.
Pellicer, S; Chen, G; Chan, K C C; Pan, Y
2008-03-01
The public computer architecture shows promise as a platform for solving fundamental problems in bioinformatics such as global gene sequence alignment and data mining with tools such as the basic local alignment search tool (BLAST). Our implementation of these two problems on the Berkeley open infrastructure for network computing (BOINC) platform demonstrates a runtime reduction factor of 1.15 for sequence alignment and 16.76 for BLAST. While the runtime reduction factor of the global gene sequence alignment application is modest, this value is based on a theoretical sequential runtime extrapolated from the calculation of a smaller problem. Because this runtime is extrapolated from running the calculation in memory, the theoretical sequential runtime would require 37.3 GB of memory on a single system. With this in mind, the BOINC implementation not only offers the reduced runtime, but also the aggregation of the available memory of all participant nodes. If an actual sequential run of the problem were compared, a more drastic reduction in the runtime would be seen due to an additional secondary storage I/O overhead for a practical system. Despite the limitations of the public computer architecture, most notably in communication overhead, it represents a practical platform for grid- and cluster-scale bioinformatics computations today and shows great potential for future implementations. PMID:18334454
The computational structural mechanics testbed architecture. Volume 2: The interface
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the third set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 3 describes the CLIP-Processor interface and related topics. It is intended only for processor developers.
Integrated command, control, communications and computation system functional architecture
NASA Technical Reports Server (NTRS)
Cooley, C. G.; Gilbert, L. E.
1981-01-01
The functional architecture for an integrated command, control, communications, and computation system applicable to the command and control portion of the NASA End-to-End Data. System is described including the downlink data processing and analysis functions required to support the uplink processes. The functional architecture is composed of four elements: (1) the functional hierarchy which provides the decomposition and allocation of the command and control functions to the system elements; (2) the key system features which summarize the major system capabilities; (3) the operational activity threads which illustrate the interrelationahip between the system elements; and (4) the interfaces which illustrate those elements that originate or generate data and those elements that use the data. The interfaces also provide a description of the data and the data utilization and access techniques.
The computational structural mechanics testbed architecture. Volume 2: Directives
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the second of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 2 describes the CLIP directives in detail. It is intended for intermediate and advanced users.
The computational structural mechanics testbed architecture. Volume 1: The language
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the first set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP, and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 1 presents the basic elements of the CLAMP language and is intended for all users.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
NASA Technical Reports Server (NTRS)
Rutishauser, David K.
2006-01-01
The motivation for this work comes from an observation that amidst the push for Massively Parallel (MP) solutions to high-end computing problems such as numerical physical simulations, large amounts of legacy code exist that are highly optimized for vector supercomputers. Because re-hosting legacy code often requires a complete re-write of the original code, which can be a very long and expensive effort, this work examines the potential to exploit reconfigurable computing machines in place of a vector supercomputer to implement an essentially unmodified legacy source code. Custom and reconfigurable computing resources could be used to emulate an original application's target platform to the extent required to achieve high performance. To arrive at an architecture that delivers the desired performance subject to limited resources involves solving a multi-variable optimization problem with constraints. Prior research in the area of reconfigurable computing has demonstrated that designing an optimum hardware implementation of a given application under hardware resource constraints is an NP-complete problem. The premise of the approach is that the general issue of applying reconfigurable computing resources to the implementation of an application, maximizing the performance of the computation subject to physical resource constraints, can be made a tractable problem by assuming a computational paradigm, such as vector processing. This research contributes a formulation of the problem and a methodology to design a reconfigurable vector processing implementation of a given application that satisfies a performance metric. A generic, parametric, architectural framework for vector processing implemented in reconfigurable logic is developed as a target for a scheduling/mapping algorithm that maps an input computation to a given instance of the architecture. This algorithm is integrated with an optimization framework to arrive at a specification of the architecture parameters
Architecture for High Speed Learning of Neural Network using Genetic Algorithm
NASA Astrophysics Data System (ADS)
Yoshikawa, Masaya; Terai, Hidekazu
This paper discusses the architecture for high speed learning of Neural Network (NN) using Genetic Algorithm (GA). The proposed architecture prevents local minimum by using the GA characteristic of holding several individual populations for a population-based search and achieves high speed processing adopting dedicated hardware. To keep general purpose equal software processing, the proposed architecture can be flexible genetic operations on GA and is introduced both Sigmoid function and Heaviside function on NN. Furthermore, the proposed architecture is not optimized only the pipeline at evaluation phase on NN, but also optimized hierarchic pipelines on the whole at evolutionary phase. We have done the simulation, verification and logic synthesis using library of 0.35μm CMOS standard cell. Simulation results evaluating the proposed architecture show to achieve 22 times speed on average compared with software processing.
Evaluation of leading scalar and vector architectures for scientific computations
Simon, Horst D.; Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Ethier, Stephane; Shalf, John
2004-04-20
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This project examines the performance of the cacheless vector Earth Simulator (ES) and compares it to superscalar cache-based IBM Power3 system. Results demonstrate that the ES is significantly faster than the Power3 architecture, highlighting the tremendous potential advantage of the ES for numerical simulation. However, vectorization of a particle-in-cell application (GTC) greatly increased the memory footprint preventing loop-level parallelism and limiting scalability potential.
The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures
Spafford, Kyle L; Meredith, Jeremy S; Lee, Seyong; Li, Dong; Roth, Philip C; Vetter, Jeffrey S
2012-01-01
With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Sorting on STAR. [CDC computer algorithm timing comparison
NASA Technical Reports Server (NTRS)
Stone, H. S.
1978-01-01
Timing comparisons are given for three sorting algorithms written for the CDC STAR computer. One algorithm is Hoare's (1962) Quicksort, which is the fastest or nearly the fastest sorting algorithm for most computers. A second algorithm is a vector version of Quicksort that takes advantage of the STAR's vector operations. The third algorithm is an adaptation of Batcher's (1968) sorting algorithm, which makes especially good use of vector operations but has a complexity of N(log N)-squared as compared with a complexity of N log N for the Quicksort algorithms. In spite of its worse complexity, Batcher's sorting algorithm is competitive with the serial version of Quicksort for vectors up to the largest that can be treated by STAR. Vector Quicksort outperforms the other two algorithms and is generally preferred. These results indicate that unusual instruction sets can introduce biases in program execution time that counter results predicted by worst-case asymptotic complexity analysis.
Biomorphic Multi-Agent Architecture for Persistent Computing
NASA Technical Reports Server (NTRS)
Lodding, Kenneth N.; Brewster, Paul
2009-01-01
A multi-agent software/hardware architecture, inspired by the multicellular nature of living organisms, has been proposed as the basis of design of a robust, reliable, persistent computing system. Just as a multicellular organism can adapt to changing environmental conditions and can survive despite the failure of individual cells, a multi-agent computing system, as envisioned, could adapt to changing hardware, software, and environmental conditions. In particular, the computing system could continue to function (perhaps at a reduced but still reasonable level of performance) if one or more component( s) of the system were to fail. One of the defining characteristics of a multicellular organism is unity of purpose. In biology, the purpose is survival of the organism. The purpose of the proposed multi-agent architecture is to provide a persistent computing environment in harsh conditions in which repair is difficult or impossible. A multi-agent, organism-like computing system would be a single entity built from agents or cells. Each agent or cell would be a discrete hardware processing unit that would include a data processor with local memory, an internal clock, and a suite of communication equipment capable of both local line-of-sight communications and global broadcast communications. Some cells, denoted specialist cells, could contain such additional hardware as sensors and emitters. Each cell would be independent in the sense that there would be no global clock, no global (shared) memory, no pre-assigned cell identifiers, no pre-defined network topology, and no centralized brain or control structure. Like each cell in a living organism, each agent or cell of the computing system would contain a full description of the system encoded as genes, but in this case, the genes would be components of a software genome.
A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures
NASA Astrophysics Data System (ADS)
Lopes, Raul H. C.; Reid, Ivan D.; Hobson, Peter R.
2014-06-01
Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track fitting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n2). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.
A Component Architecture for High-Performance Computing
Bernholdt, D E; Elwasif, W R; Kohl, J A; Epperly, T G W
2003-01-21
The Common Component Architecture (CCA) provides a means for developers to manage the complexity of large-scale scientific software systems and to move toward a ''plug and play'' environment for high-performance computing. The CCA model allows for a direct connection between components within the same process to maintain performance on inter-component calls. It is neutral with respect to parallelism, allowing components to use whatever means they desire to communicate within their parallel ''cohort.'' We will discuss in detail the importance of performance in the design of the CCA and will analyze the performance costs associated with features of the CCA.
Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; Van der Wijngaart, Rob
2003-05-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly out performs the cache-based architectures. However, certain applications are not easily amenable to vectorization and would re quire extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator
ERIC Educational Resources Information Center
Patti, D.; Spadaccini, A.; Palesi, M.; Fazzino, F.; Catania, V.
2012-01-01
The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This…
NASA Technical Reports Server (NTRS)
Rickard, D. A.; Bodenheimer, R. E.
1976-01-01
Digital computer components which perform two dimensional array logic operations (Tse logic) on binary data arrays are described. The properties of Golay transforms which make them useful in image processing are reviewed, and several architectures for Golay transform processors are presented with emphasis on the skeletonizing algorithm. Conventional logic control units developed for the Golay transform processors are described. One is a unique microprogrammable control unit that uses a microprocessor to control the Tse computer. The remaining control units are based on programmable logic arrays. Performance criteria are established and utilized to compare the various Golay transform machines developed. A critique of Tse logic is presented, and recommendations for additional research are included.
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)
NASA Astrophysics Data System (ADS)
Vincenti, Henri
2016-03-01
The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Fast algorithms for computing isogenies between elliptic curves
NASA Astrophysics Data System (ADS)
Bostan, A.; Morain, F.; Salvy, B.; Schost, E.
2008-09-01
We survey algorithms for computing isogenies between elliptic curves defined over a field of characteristic either 0 or a large prime. We introduce a new algorithm that computes an isogeny of degree ell ( ell different from the characteristic) in time quasi-linear with respect to ell E This is based in particular on fast algorithms for power series expansion of the Weierstrass wp -function and related functions.
A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm
Guo, Xinyu; Wang, Hong; Devabhaktuni, Vijay
2012-01-01
A design of systolic array-based Field Programmable Gate Array (FPGA) parallel architecture for Basic Local Alignment Search Tool (BLAST) Algorithm is proposed. BLAST is a heuristic biological sequence alignment algorithm which has been used by bioinformatics experts. In contrast to other designs that detect at most one hit in one-clock-cycle, our design applies a Multiple Hits Detection Module which is a pipelining systolic array to search multiple hits in a single-clock-cycle. Further, we designed a Hits Combination Block which combines overlapping hits from systolic array into one hit. These implementations completed the first and second step of BLAST architecture and achieved significant speedup comparing with previously published architectures. PMID:25969747
Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures
Dongarra, Jack
2013-03-14
There is a widening gap between the peak performance of high performance computers and the performance realized by full applications. Over the next decade, extreme-scale systems will present major new challenges to software development that could widen the gap so much that it prevents the productive use of future DOE Leadership computers.
High-speed computation of the EM algorithm for PET image reconstruction
Rajan, K.; Patnaik, L.M.; Ramakrishna, J. )
1994-10-01
The PET image reconstruction based on the EM algorithm has several attractive advantages over the conventional convolution backprojection algorithms. However, two major drawbacks have impeded the routine use of the EM algorithm, namely, the long computational time due to slow convergence and the large memory required for the storage of the image, projection data and the probability matrix. In this study, the authors attempts to solve these two problems by parallelizing the EM algorithm on a multiprocessor system. The authors have implemented an extended hypercube (EH) architecture for the high-speed computation of the EM algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs). The authors discuss and compare the performance of the EM algorithm on a 386/387 machine, CD 4360 mainframe, and on the EH system. The results show that the computational speed performance of an EH using DSP chips as PEs executing the EM image reconstruction algorithm is about 130 times better than that of the CD 4360 mainframe. The EH topology is expandable with more number of PEs.
Computational architecture for image processing on a small unmanned ground vehicle
NASA Astrophysics Data System (ADS)
Ho, Sean; Nguyen, Hung
2010-08-01
Man-portable Unmanned Ground Vehicles (UGVs) have been fielded on the battlefield with limited computing power. This limitation constrains their use primarily to teleoperation control mode for clearing areas and bomb defusing. In order to extend their capability to include the reconnaissance and surveillance missions of dismounted soldiers, a separate processing payload is desired. This paper presents a processing architecture and the design details on the payload module that enables the PackBot to perform sophisticated, real-time image processing algorithms using data collected from its onboard imaging sensors including LADAR, IMU, visible, IR, stereo, and the Ladybug spherical cameras. The entire payload is constructed from currently available Commercial off-the-shelf (COTS) components including an Intel multi-core CPU and a Nvidia GPU. The result of this work enables a small UGV to perform computationally expensive image processing tasks that once were only feasible on a large workstation.
MediaMesh: an architecture for integrating isochronous processing algorithms into media servers
NASA Astrophysics Data System (ADS)
Amini, Lisa D.; Lepre, Jorge; Kienzle, Martin G.
1999-12-01
Current media servers do not provide the generality required to easily integrate arbitrary isochronous processing algorithms into streams of continuous media. Specifically, present day video server architectures primarily focus on disk and network strategies for efficiently managing available resources under stringent QoS guarantees. However, they do not fully consider the problems of integrating the wide variety of algorithms required for interactive multimedia applications. Examples of applications benefiting from a more flexible server environment include watermarking, encrypting or scrambling streams, visual VCR operations, and multiplexing or demultiplexing of live presentations. In this paper, we detail the MediaMesh architecture for integrating arbitrary isochronous processing algorithms into general purpose media servers. Our framework features a programming model through which user-written modules can be dynamically loaded and interconnected in self-managing graphs of stream processing components. Design highlights include novel techniques for distributed stream control, efficient buffer management and QoS management. To demonstrate its applicability, we have implemented the MediaMesh architecture in the context of a commercial video server. We illustrate the viability of the architecture through performance data collected from four processing modules that were implemented to facilitate new classes of applications on our video server.
An architecture for quantum computation with magnetically trapped Holmium atoms
NASA Astrophysics Data System (ADS)
Saffman, Mark; Hostetter, James; Booth, Donald; Collett, Jeffrey
2016-05-01
Outstanding challenges for scalable neutral atom quantum computation include correction of atom loss due to collisions with untrapped background gas, reduction of crosstalk during state preparation and measurement due to scattering of near resonant light, and the need to improve quantum gate fidelity. We present a scalable architecture based on loading single Holmium atoms into an array of Ioffe-Pritchard traps. The traps are formed by grids of superconducting wires giving a trap array with 40 μm period, suitable for entanglement via long range Rydberg gates. The states | F = 5 , M = 5 > and | F = 7 , M = 7 > provide a magic trapping condition at a low field of 3.5 G for long coherence time qubit encoding. The F = 11 level will be used for state preparation and measurement. The availability of different states for encoding, gate operations, and measurement, spectroscopically isolates the different operations and will prevent crosstalk to neighboring qubits. Operation in a cryogenic environment with ultra low pressure will increase atom lifetime and Rydberg gate fidelity by reduction of blackbody induced Rydberg decay. We will present a complete description of the architecture including estimates of achievable performance metrics. Work supported by NSF award PHY-1404357.
Computationally efficient algorithms for real-time attitude estimation
NASA Technical Reports Server (NTRS)
Pringle, Steven R.
1993-01-01
For many practical spacecraft applications, algorithms for determining spacecraft attitude must combine inputs from diverse sensors and provide redundancy in the event of sensor failure. A Kalman filter is suitable for this task, however, it may impose a computational burden which may be avoided by sub optimal methods. A suboptimal estimator is presented which was implemented successfully on the Delta Star spacecraft which performed a 9 month SDI flight experiment in 1989. This design sought to minimize algorithm complexity to accommodate the limitations of an 8K guidance computer. The algorithm used is interpreted in the framework of Kalman filtering and a derivation is given for the computation.
A novel bit-quad-based Euler number computing algorithm.
Yao, Bin; He, Lifeng; Kang, Shiying; Chao, Yuyan; Zhao, Xiao
2015-01-01
The Euler number of a binary image is an important topological property in computer vision and pattern recognition. This paper proposes a novel bit-quad-based Euler number computing algorithm. Based on graph theory and analysis on bit-quad patterns, our algorithm only needs to count two bit-quad patterns. Moreover, by use of the information obtained during processing the previous bit-quad, the average number of pixels to be checked for processing a bit-quad is only 1.75. Experimental results demonstrated that our method outperforms significantly conventional Euler number computing algorithms. PMID:26636023
Chakrabarti, C. . Dept. of Electrical Engineering); Ja Ja, J. . Dept. of Electrical Engineering)
1990-11-01
This paper proposes two-dimensional systolic array implementations for computing the discrete Hartley (DHT) and the discrete cosine transforms (DCT) when the transform size N is decomposable into mutually prime factors. The existing two-dimensional formulations for DHT and DCT are modified and the corresponding algorithms are mapped into two-dimensional systolic arrays. The resulting architecture is fully pipelined with no control units. The hardware design is based on bit serial left to right MSB to LSB binary arithmetic.
Examining the architecture of cellular computing through a comparative study with a computer.
Wang, Degeng; Gribskov, Michael
2005-06-22
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching
Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel
2012-12-28
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data, which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variability, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. In this paper, we discuss the implementation of the Aho-Corasick algorithm for GPU-accelerated high performance systems. We present an optimized implementation of Aho-Corasick for GPUs and discuss its tradeoffs on the Tesla T10 and he new Tesla T20 (codename Fermi) GPUs. We then integrate the optimized GPU code, respectively, in a MPI-based and in a pthreads-based load balancer to enable execution of the algorithm on clusters and large sharedmemory multiprocessors (SMPs) accelerated with multiple GPUs.
Genetic algorithms: What computers can learn from Darwin
Walbridge, C.T. )
1989-01-01
In this article the author posits a field of computing based on the genetic algorithm. This approach to programming mimics evolution by utilizing a computer to solve problems on a trial and error basis and ascertain the best answer through natural selection of the best of the computer's guesses. The author discusses the viability of this system in comparison to that of artificial intelligence.
The Use of Alternative Algorithms in Whole Number Computation
ERIC Educational Resources Information Center
Norton, Stephen
2012-01-01
Pedagogical reform in Australia over the last few decades has resulted in a reduced emphasis on the teaching of computational algorithms and a diversity of alternative mechanisms to teach students whole number computations. The effect of these changes upon student recording of whole number computations has had little empirical investigation. As…
AES-128 Bit Algorithm Using Fully Pipelined Architecture for Secret Communication
NASA Astrophysics Data System (ADS)
Gnanambika, M.; Adilakshmi, S.; Noorbasha, Fazal
2012-03-01
In this paper, an efficient method for high speed hardware implementation of AES algorithm is presented. So far, many implementations of AES have been proposed, for various goals that effect the Sub Byte transformation in various ways. These methods of implementation are based on combinational logic and are done in polynomial bases. In the proposed architecture, it is done by using composite field arithmetic in normal bases. In addition, efficient key expansion architecture suitable for 6 sub pipelined round units is also presented. These designs were described using VerilogHDL, simulated using Modelsim.
Efficient algorithm to compute the Berry conductivity
NASA Astrophysics Data System (ADS)
Dauphin, A.; Müller, M.; Martin-Delgado, M. A.
2014-07-01
We propose and construct a numerical algorithm to calculate the Berry conductivity in topological band insulators. The method is applicable to cold atom systems as well as solid state setups, both for the insulating case where the Fermi energy lies in the gap between two bulk bands as well as in the metallic regime. This method interpolates smoothly between both regimes. The algorithm is gauge-invariant by construction, efficient, and yields the Berry conductivity with known and controllable statistical error bars. We apply the algorithm to several paradigmatic models in the field of topological insulators, including Haldane's model on the honeycomb lattice, the multi-band Hofstadter model, and the BHZ model, which describes the 2D spin Hall effect observed in CdTe/HgTe/CdTe quantum well heterostructures.
Efficient Homotopy Continuation Algorithms with Application to Computational Fluid Dynamics
NASA Astrophysics Data System (ADS)
Brown, David A.
New homotopy continuation algorithms are developed and applied to a parallel implicit finite-difference Newton-Krylov-Schur external aerodynamic flow solver for the compressible Euler, Navier-Stokes, and Reynolds-averaged Navier-Stokes equations with the Spalart-Allmaras one-equation turbulence model. Many new analysis tools, calculations, and numerical algorithms are presented for the study and design of efficient and robust homotopy continuation algorithms applicable to solving very large and sparse nonlinear systems of equations. Several specific homotopies are presented and studied and a methodology is presented for assessing the suitability of specific homotopies for homotopy continuation. . A new class of homotopy continuation algorithms, referred to as monolithic homotopy continuation algorithms, is developed. These algorithms differ from classical predictor-corrector algorithms by combining the predictor and corrector stages into a single update, significantly reducing the amount of computation and avoiding wasted computational effort resulting from over-solving in the corrector phase. The new algorithms are also simpler from a user perspective, with fewer input parameters, which also improves the user's ability to choose effective parameters on the first flow solve attempt. Conditional convergence is proved analytically and studied numerically for the new algorithms. The performance of a fully-implicit monolithic homotopy continuation algorithm is evaluated for several inviscid, laminar, and turbulent flows over NACA 0012 airfoils and ONERA M6 wings. The monolithic algorithm is demonstrated to be more efficient than the predictor-corrector algorithm for all applications investigated. It is also demonstrated to be more efficient than the widely-used pseudo-transient continuation algorithm for all inviscid and laminar cases investigated, and good performance scaling with grid refinement is demonstrated for the inviscid cases. Performance is also demonstrated
A fast algorithm for sparse matrix computations related to inversion
NASA Astrophysics Data System (ADS)
Li, S.; Wu, W.; Darve, E.
2013-06-01
We have developed a fast algorithm for computing certain entries of the inverse of a sparse matrix. Such computations are critical to many applications, such as the calculation of non-equilibrium Green's functions Gr and G< for nano-devices. The FIND (Fast Inverse using Nested Dissection) algorithm is optimal in the big-O sense. However, in practice, FIND suffers from two problems due to the width-2 separators used by its partitioning scheme. One problem is the presence of a large constant factor in the computational cost of FIND. The other problem is that the partitioning scheme used by FIND is incompatible with most existing partitioning methods and libraries for nested dissection, which all use width-1 separators. Our new algorithm resolves these problems by thoroughly decomposing the computation process such that width-1 separators can be used, resulting in a significant speedup over FIND for realistic devices — up to twelve-fold in simulation. The new algorithm also has the added advantage that desired off-diagonal entries can be computed for free. Consequently, our algorithm is faster than the current state-of-the-art recursive methods for meshes of any size. Furthermore, the framework used in the analysis of our algorithm is the first attempt to explicitly apply the widely-used relationship between mesh nodes and matrix computations to the problem of multiple eliminations with reuse of intermediate results. This framework makes our algorithm easier to generalize, and also easier to compare against other methods related to elimination trees. Finally, our accuracy analysis shows that the algorithms that require back-substitution are subject to significant extra round-off errors, which become extremely large even for some well-conditioned matrices or matrices with only moderately large condition numbers. When compared to these back-substitution algorithms, our algorithm is generally a few orders of magnitude more accurate, and our produced round-off errors
Parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
NASA Astrophysics Data System (ADS)
Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
2016-05-01
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
Efficient implementation of Jacobi algorithms and Jacobi sets on distributed memory architectures
Eberlein, P.J. ); Park, H. )
1990-04-01
One-sided methods for implementing Jacobi diagonalization algorithms have been recently proposed for both distributed memory and vector machines. These methods are naturally well suited to distributed memory and vector architectures because of their inherent parallelism and their abundance of vector operations. Also, one-sided methods require substantially less message passing than the two-sided methods, and thus can achieve higher efficiency. The authors describe in detail the use of the one-sided Jacobi rotation as opposed to the rotation used in the Hestenes algorithm; they perceive the difference to have been widely misunderstood. Furthermore, the one-sided algorithm generalizes to other problems such as the nonsymmetric eigenvalue problem while the Hestenes algorithm does not. The authors discuss two new implementations for Jacobi sets for a ring connected array of processors and show their isomorphism to the round-robin ordering.
A computational fluid dynamics algorithm on a massively parallel computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The implementation and performance of a finite-difference algorithm for the compressible Navier-Stokes equations in two or three dimensions on the Connection Machine are described. This machine is a single-instruction multiple-data machine with up to 65536 physical processors. The implicit portion of the algorithm is of particular interest. Running times and megadrop rates are given for two- and three-dimensional problems. Included are comparisons with the standard codes on a Cray X-MP/48.
Earth Science Computational Architecture for Multi-disciplinary Investigations
NASA Astrophysics Data System (ADS)
Parker, J. W.; Blom, R.; Gurrola, E.; Katz, D.; Lyzenga, G.; Norton, C.
2005-12-01
Understanding the processes underlying Earth's deformation and mass transport requires a non-traditional, integrated, interdisciplinary, approach dependent on multiple space and ground based data sets, modeling, and computational tools. Currently, details of geophysical data acquisition, analysis, and modeling largely limit research to discipline domain experts. Interdisciplinary research requires a new computational architecture that is optimized to perform complex data processing of multiple solid Earth science data types in a user-friendly environment. A web-based computational framework is being developed and integrated with applications for automatic interferometric radar processing, and models for high-resolution deformation & gravity, forward models of viscoelastic mass loading over short wavelengths & complex time histories, forward-inverse codes for characterizing surface loading-response over time scales of days to tens of thousands of years, and inversion of combined space magnetic & gravity fields to constrain deep crustal and mantle properties. This framework combines an adaptation of the QuakeSim distributed services methodology with the Pyre framework for multiphysics development. The system uses a three-tier architecture, with a middle tier server that manages user projects, available resources, and security. This ensures scalability to very large networks of collaborators. Users log into a web page and have a personal project area, persistently maintained between connections, for each application. Upon selection of an application and host from a list of available entities, inputs may be uploaded or constructed from web forms and available data archives, including gravity, GPS and imaging radar data. The user is notified of job completion and directed to results posted via URLs. Interdisciplinary work is supported through easy availability of all applications via common browsers, application tutorials and reference guides, and worked examples with
E-Governance and Service Oriented Computing Architecture Model
NASA Astrophysics Data System (ADS)
Tejasvee, Sanjay; Sarangdevot, S. S.
2010-11-01
E-Governance is the effective application of information communication and technology (ICT) in the government processes to accomplish safe and reliable information lifecycle management. Lifecycle of the information involves various processes as capturing, preserving, manipulating and delivering information. E-Governance is meant to transform of governance in better manner to the citizens which is transparent, reliable, participatory, and accountable in point of view. The purpose of this paper is to attempt e-governance model, focus on the Service Oriented Computing Architecture (SOCA) that includes combination of information and services provided by the government, innovation, find out the way of optimal service delivery to citizens and implementation in transparent and liable practice. This paper also try to enhance focus on the E-government Service Manager as a essential or key factors service oriented and computing model that provides a dynamically extensible structural design in which all area or branch can bring in innovative services. The heart of this paper examine is an intangible model that enables E-government communication for trade and business, citizen and government and autonomous bodies.
NASA Technical Reports Server (NTRS)
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
Algorithm for Computing Particle/Surface Interactions
NASA Technical Reports Server (NTRS)
Hughes, David W.
2009-01-01
An algorithm has been devised for predicting the behaviors of sparsely spatially distributed particles impinging on a solid surface in a rarefied atmosphere. Under the stated conditions, prior particle-transport models in which (1) dense distributions of particles are treated as continuum fluids; or (2) sparse distributions of particles are considered to be suspended in and to diffuse through fluid streams are not valid.
NASA Technical Reports Server (NTRS)
Gentzsch, W.
1982-01-01
Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
Visual pattern recognition network: its training algorithm and its optoelectronic architecture
NASA Astrophysics Data System (ADS)
Wang, Ning; Liu, Liren
1996-07-01
A visual pattern recognition network and its training algorithm are proposed. The network constructed of a one-layer morphology network and a two-layer modified Hamming net. This visual network can implement invariant pattern recognition with respect to image translation and size projection. After supervised learning takes place, the visual network extracts image features and classifies patterns much the same as living beings do. Moreover we set up its optoelectronic architecture for real-time pattern recognition.
A parallel sparse algorithm targeting arterial fluid mechanics computations
NASA Astrophysics Data System (ADS)
Manguoglu, Murat; Takizawa, Kenji; Sameh, Ahmed H.; Tezduyar, Tayfun E.
2011-09-01
Iterative solution of large sparse nonsymmetric linear equation systems is one of the numerical challenges in arterial fluid-structure interaction computations. This is because the fluid mechanics parts of the fluid + structure block of the equation system that needs to be solved at every nonlinear iteration of each time step corresponds to incompressible flow, the computational domains include slender parts, and accurate wall shear stress calculations require boundary layer mesh refinement near the arterial walls. We propose a hybrid parallel sparse algorithm, domain-decomposing parallel solver (DDPS), to address this challenge. As the test case, we use a fluid mechanics equation system generated by starting with an arterial shape and flow field coming from an FSI computation and performing two time steps of fluid mechanics computation with a prescribed arterial shape change, also coming from the FSI computation. We show how the DDPS algorithm performs in solving the equation system and demonstrate the scalability of the algorithm.
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention
NASA Astrophysics Data System (ADS)
Dasgupta, Udayan; Ali, Murtaza
2011-03-01
Low dose X-ray image sequences, as obtained in fluoroscopy, exhibit high levels of noise that must be suppressed in real-time, while preserving diagnostic structures. Multi-step adaptive filtering approaches, often involving spatio-temporal filters, are typically used to achieve this goal. In this work typical fluoroscopic image sequences, corrupted with Poisson noise, were processed using various filtering schemes. The noise suppression of the schemes was evaluated using objective image quality measures. Two adaptive spatio-temporal schemes, the first one using object detection and the second one using unsharp masking, were chosen as representative approaches for different fluoroscopy procedures and mapped on to Texas Instrument's (TI) high performance digital signal processors (DSP). The paper explains the fixed point design of these algorithms and evaluates its impact on overall system performance. The fixed point versions of these algorithms are mapped onto the C64x+TM core using instruction-level parallelism to effectively use its VLIW architecture. The overall data flow was carefully planned to reduce cache and data movement overhead, while working with large medical data sets. Apart from mapping these algorithms on to TI's single core DSP architecture, this work also distributes the operations to leverage multi-core DSP architectures. The data arrangement and flow were optimized to minimize inter-processor messaging and data movement overhead.
Algorithm implementation on the Navier-Stokes computer
NASA Technical Reports Server (NTRS)
Krist, Steven E.; Zang, Thomas A.
1987-01-01
The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Program partitioning and scheduling for NUMA computer architectures
Wolski, R.M.
1994-03-01
To effect the parallel execution of a program on a multiprocessor, each of the program`s constituent computations must be assigned to a processing resource within the multiprocessor. The problem of making this assignment so that execution time is minimized (known as the mapping problem) has been shown to be NP-complete. However, heuristics based on the performance characteristics of the target multiprocessor can yield execution times that approach the minimum possible. The mapping problem can be divided in to the problem of partitioning the computations into sequential threads, and the problem of scheduling those threads on the processors of the target system. This dissertation presents a logical framework and a set of heuristics that operate within the framework for the automatic partitioning and scheduling of programs at compile-time. The framework is based on the memory-node execution model which correctly captures the interaction between computations, processors, and the communication resources within a multiprocessor. The CP and HEF heuristics manipulate the features of the memory-node model to produce efficient program mappings. The effectiveness of the partitioning and scheduling techniques is investigated for Non-uniform Memory Access (NUMA) architecture types. To test the versatility of the approach, results are presented both for processors implementing strict execution semantics, and non-strict load/store semantics popular with RISC systems. The partitioner and scheduler are also used to investigate the possible advantages of multithreading (using either hardware or software), and the effectiveness of massively parallel systems, within a scientific programming context.
NASA Technical Reports Server (NTRS)
Ioup, G. E.
1985-01-01
The development of vector processing computers of the streaming array architectures has made possible a dramatic decrease in the time required for the solution of problems, i.e., for those algorithms which readily lend themselves to sequential operations on long vectors. There has concurrently been rapid growth in the applications of the techniques generally known as mathematical digital filtering, also called signal analysis, digital signal processing, time series analysis, and digital image processing. The best known applications of these techniques is to seismic data and two-dimensional images, data types consisting of very large collections of numbers. A major limitation for these cases is the size and speed of the computer available. Therefore, a move to the streaming array architecture can result in marked improvement in the data analysis techniques which may be employed.
Algorithms and software for solving finite element equations on serial and parallel architectures
NASA Technical Reports Server (NTRS)
George, Alan
1989-01-01
Over the past 15 years numerous new techniques have been developed for solving systems of equations and eigenvalue problems arising in finite element computations. A package called SPARSPAK has been developed by the author and his co-workers which exploits these new methods. The broad objective of this research project is to incorporate some of this software in the Computational Structural Mechanics (CSM) testbed, and to extend the techniques for use on multiprocessor architectures.
Parallel matrix transpose algorithms on distributed memory concurrent computers
Choi, Jaeyoung; Dongarra, J. |; Walker, D.W.
1994-12-31
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P {times} Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A {center_dot} B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T} {center_dot} B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
Durand, Melissa A; Wang, Steven; Hooley, Regina J; Raghu, Madhavi; Philpotts, Liane E
2016-01-01
As use of digital breast tomosynthesis becomes increasingly widespread, new management challenges are inevitable because tomosynthesis may reveal suspicious lesions not visible at conventional two-dimensional (2D) full-field digital mammography. Architectural distortion is a mammographic finding associated with a high positive predictive value for malignancy. It is detected more frequently at tomosynthesis than at 2D digital mammography and may even be occult at conventional 2D imaging. Few studies have focused on tomosynthesis-detected architectural distortions to date, and optimal management of these distortions has yet to be well defined. Since implementing tomosynthesis at our institution in 2011, we have learned some practical ways to assess architectural distortion. Because distortions may be subtle, tomosynthesis localization tools plus improved visualization of adjacent landmarks are crucial elements in guiding mammographic identification of elusive distortions. These same tools can guide more focused ultrasonography (US) of the breast, which facilitates detection and permits US-guided tissue sampling. Some distortions may be sonographically occult, in which case magnetic resonance imaging may be a reasonable option, both to increase diagnostic confidence and to provide a means for image-guided biopsy. As an alternative, tomosynthesis-guided biopsy, conventional stereotactic biopsy (when possible), or tomosynthesis-guided needle localization may be used to achieve tissue diagnosis. Practical uses for tomosynthesis in evaluation of architectural distortion are highlighted, potential complications are identified, and a working algorithm for management of tomosynthesis-detected architectural distortion is proposed. (©)RSNA, 2016. PMID:26963448
Neural computing architectures: The design of brain-like machines
Aleksander, I.
1989-01-01
Theoretical and applications aspects of neural-network (NN) computers are discussed in chapters contributed by European experts. Topics addressed include speech recognition based on topology-preserving neural maps, neural-map applications, backpropagation in nonfeedforward NNs, a parallel-distributed-processing learning approach to natural language, the learning capabilities of Boolean NNs, the logic of connectionist systems, and a probabilistic-logic NN for associative learning. Consideration is given to N-tuple sampling and genetic algorithms for speech recognition; the dynamic behavior of Boolean NNs; statistical mechanics and NNs; digital NNs, matched filters, and optical implementations; heteroassociative NNs using cabling vs link-disabling local modification rules; and the generation of movement trajectories in primates and robots. Also provided is an overview of parallel distributed processing.
Data Compression Algorithm Architecture for Large Depth-of-Field Particle Image Velocimeters
NASA Technical Reports Server (NTRS)
Bos, Brent; Memarsadeghi, Nargess; Kizhner, Semion; Antonille, Scott
2013-01-01
A large depth-of-field particle image velocimeter (PIV) is designed to characterize dynamic dust environments on planetary surfaces. This instrument detects lofted dust particles, and senses the number of particles per unit volume, measuring their sizes, velocities (both speed and direction), and shape factors when the particles are large. To measure these particle characteristics in-flight, the instrument gathers two-dimensional image data at a high frame rate, typically >4,000 Hz, generating large amounts of data for every second of operation, approximately 6 GB/s. To characterize a planetary dust environment that is dynamic, the instrument would have to operate for at least several minutes during an observation period, easily producing more than a terabyte of data per observation. Given current technology, this amount of data would be very difficult to store onboard a spacecraft, and downlink to Earth. Since 2007, innovators have been developing an autonomous image analysis algorithm architecture for the PIV instrument to greatly reduce the amount of data that it has to store and downlink. The algorithm analyzes PIV images and automatically reduces the image information down to only the particle measurement data that is of interest, reducing the amount of data that is handled by more than 10(exp 3). The state of development for this innovation is now fairly mature, with a functional algorithm architecture, along with several key pieces of algorithm logic, that has been proven through field test data acquired with a proof-of-concept PIV instrument.
Computational Algorithms for Device-Circuit Coupling
KEITER, ERIC R.; HUTCHINSON, SCOTT A.; HOEKSTRA, ROBERT J.; RANKIN, ERIC LAMONT; RUSSO, THOMAS V.; WATERS, LON J.
2003-01-01
Circuit simulation tools (e.g., SPICE) have become invaluable in the development and design of electronic circuits. Similarly, device-scale simulation tools (e.g., DaVinci) are commonly used in the design of individual semiconductor components. Some problems, such as single-event upset (SEU), require the fidelity of a mesh-based device simulator but are only meaningful when dynamically coupled with an external circuit. For such problems a mixed-level simulator is desirable, but the two types of simulation generally have different (sometimes conflicting) numerical requirements. To address these considerations, we have investigated variations of the two-level Newton algorithm, which preserves tight coupling between the circuit and the partial differential equations (PDE) device, while optimizing the numerics for both.
On the performances of computer vision algorithms on mobile platforms
NASA Astrophysics Data System (ADS)
Battiato, S.; Farinella, G. M.; Messina, E.; Puglisi, G.; Ravì, D.; Capra, A.; Tomaselli, V.
2012-01-01
Computer Vision enables mobile devices to extract the meaning of the observed scene from the information acquired with the onboard sensor cameras. Nowadays, there is a growing interest in Computer Vision algorithms able to work on mobile platform (e.g., phone camera, point-and-shot-camera, etc.). Indeed, bringing Computer Vision capabilities on mobile devices open new opportunities in different application contexts. The implementation of vision algorithms on mobile devices is still a challenging task since these devices have poor image sensors and optics as well as limited processing power. In this paper we have considered different algorithms covering classic Computer Vision tasks: keypoint extraction, face detection, image segmentation. Several tests have been done to compare the performances of the involved mobile platforms: Nokia N900, LG Optimus One, Samsung Galaxy SII.
NASA Astrophysics Data System (ADS)
Wu, Huayi; Guan, Xuefeng; Gong, Jianya
2011-09-01
This paper presents a robust parallel Delaunay triangulation algorithm called ParaStream for processing billions of points from nonoverlapped block LiDAR files. The algorithm targets ubiquitous multicore architectures. ParaStream integrates streaming computation with a traditional divide-and-conquer scheme, in which additional erase steps are implemented to reduce the runtime memory footprint. Furthermore, a kd-tree-based dynamic schedule strategy is also proposed to distribute triangulation and merging work onto the processor cores for improved load balance. ParaStream exploits most of the computing power of multicore platforms through parallel computing, demonstrating qualities of high data throughput as well as a low memory footprint. Experiments on a 2-Way-Quad-Core Intel Xeon platform show that ParaStream can triangulate approximately one billion LiDAR points (16.4 GB) in about 16 min with only 600 MB physical memory. The total speedup (including I/O time) is about 6.62 with 8 concurrent threads.
Zhang, Hao; Zhao, Yan; Cao, Liangcai; Jin, Guofan
2015-02-23
We propose an algorithm based on fully computed holographic stereogram for calculating full-parallax computer-generated holograms (CGHs) with accurate depth cues. The proposed method integrates point source algorithm and holographic stereogram based algorithm to reconstruct the three-dimensional (3D) scenes. Precise accommodation cue and occlusion effect can be created, and computer graphics rendering techniques can be employed in the CGH generation to enhance the image fidelity. Optical experiments have been performed using a spatial light modulator (SLM) and a fabricated high-resolution hologram, the results show that our proposed algorithm can perform quality reconstructions of 3D scenes with arbitrary depth information. PMID:25836429
Applying a cloud computing approach to storage architectures for spacecraft
NASA Astrophysics Data System (ADS)
Baldor, Sue A.; Quiroz, Carlos; Wood, Paul
As sensor technologies, processor speeds, and memory densities increase, spacecraft command, control, processing, and data storage systems have grown in complexity to take advantage of these improvements and expand the possible missions of spacecraft. Spacecraft systems engineers are increasingly looking for novel ways to address this growth in complexity and mitigate associated risks. Looking to conventional computing, many solutions have been executed to solve both the problem of complexity and heterogeneity in systems. In particular, the cloud-based paradigm provides a solution for distributing applications and storage capabilities across multiple platforms. In this paper, we propose utilizing a cloud-like architecture to provide a scalable mechanism for providing mass storage in spacecraft networks that can be reused on multiple spacecraft systems. By presenting a consistent interface to applications and devices that request data to be stored, complex systems designed by multiple organizations may be more readily integrated. Behind the abstraction, the cloud storage capability would manage wear-leveling, power consumption, and other attributes related to the physical memory devices, critical components in any mass storage solution for spacecraft. Our approach employs SpaceWire networks and SpaceWire-capable devices, although the concept could easily be extended to non-heterogeneous networks consisting of multiple spacecraft and potentially the ground segment.
Hybrid VLSI/QCA Architecture for Computing FFTs
NASA Technical Reports Server (NTRS)
Fijany, Amir; Toomarian, Nikzad; Modarres, Katayoon; Spotnitz, Matthew
2003-01-01
A data-processor architecture that would incorporate elements of both conventional very-large-scale integrated (VLSI) circuitry and quantum-dot cellular automata (QCA) has been proposed to enable the highly parallel and systolic computation of fast Fourier transforms (FFTs). The proposed circuit would complement the QCA-based circuits described in several prior NASA Tech Briefs articles, namely Implementing Permutation Matrices by Use of Quantum Dots (NPO-20801), Vol. 25, No. 10 (October 2001), page 42; Compact Interconnection Networks Based on Quantum Dots (NPO-20855) Vol. 27, No. 1 (January 2003), page 32; and Bit-Serial Adder Based on Quantum Dots (NPO-20869), Vol. 27, No. 1 (January 2003), page 35. The cited prior articles described the limitations of very-large-scale integrated (VLSI) circuitry and the major potential advantage afforded by QCA. To recapitulate: In a VLSI circuit, signal paths that are required not to interact with each other must not cross in the same plane. In contrast, for reasons too complex to describe in the limited space available for this article, suitably designed and operated QCAbased signal paths that are required not to interact with each other can nevertheless be allowed to cross each other in the same plane without adverse effect. In principle, this characteristic could be exploited to design compact, coplanar, simple (relative to VLSI) QCA-based networks to implement complex, advanced interconnection schemes.
PHENIX On-Line Distributed Computing System Architecture
Desmond, Edmond; Haggerty, John; Kehayias, Hyon Joo; Purschke, Martin L.; Witzig, Chris; Kozlowski, Thomas
1997-05-22
PHENIX is one of the two large experiments at the Relativistic Heavy Ion Collider (RHIC) currently under construction at Brookhaven National Laboratory. The detector consists of 11 sub-detectors, that are further subdivided into 29 units (``granules``) that can be operated independently, which includes simultaneous data taking with independent data streams and independent triggers. The detector has 250,000 channels and is read out by front end modules, where the data is buffered in a pipeline while awaiting the level trigger decision. Zero suppression and calibration is done after the level accept in custom built data collection modules (DCMs) with DSPs before the data is sent to an event builder (design throughput of 2 Gb/sec) and higher level triggers. The On-line Computing Systems Group (ONCS) has two responsibilities. Firstly it is responsible for receiving the data from the event builder, routing it through a network of workstations to consumer processes and archiving it at a data rate of 20 MB/sec. Secondly it is also responsible for the overall configuration, control and operation of the detector and data acquisition chain, which comprises the software integration for several thousand custom built hardware modules. The software must furthermore support the independent operation of the above mentioned granules, which includes the coordination of processes that run in 60-100 VME processors and workstations. ONOS has adapted the Shlaer- Mellor Object Oriented Methodology for the design of the top layer software. CORBA is used as communication layer between the distributed objects, which are implemented as asynchronous finite state machines. We will give an overview of the PHENIX online system with the main focus on the system architecture, software components and integration tasks of the On-line Computing group ONCS and report on the status of the current prototypes.
A one-sided Jacobi algorithm for computing the singular value decomposition on a vector computer
De Rijk, P.P.M. )
1989-03-01
An old algorithm for computing the singular value decomposition, which was first mentioned by Hestenes has gained renewed interest because of its properties of parallelism and vectorizability. Some computational modifications are given and a comparison with the well-known Golub-Reinsch algorithm is made. In this paper comparative experiments on CYBER 205 are reported.
Parallel algorithm for computing 3-D reachable workspaces
NASA Astrophysics Data System (ADS)
Alameldin, Tarek K.; Sobh, Tarek M.
1992-03-01
The problem of computing the 3-D workspace for redundant articulated chains has applications in a variety of fields such as robotics, computer aided design, and computer graphics. The computational complexity of the workspace problem is at least NP-hard. The recent advent of parallel computers has made practical solutions for the workspace problem possible. Parallel algorithms for computing the 3-D workspace for redundant articulated chains with joint limits are presented. The first phase of these algorithms computes workspace points in parallel. The second phase uses workspace points that are computed in the first phase and fits a 3-D surface around the volume that encompasses the workspace points. The second phase also maps the 3- D points into slices, uses region filling to detect the holes and voids in the workspace, extracts the workspace boundary points by testing the neighboring cells, and tiles the consecutive contours with triangles. The proposed algorithms are efficient for computing the 3-D reachable workspace for articulated linkages, not only those with redundant degrees of freedom but also those with joint limits.
A computer algorithm for automatic beam steering
Drennan, E.
1992-06-01
Beam steering is done by modifying the current in a trim or bending magnet. If the current change is the right amount the beam can be made to bend in such a manner that it will hit a swic or BPM downstream from the magnet at a predetermined set point. Although both bending magnets and trim magnets can be used to modify beam angle, beam steering is usually done with trim magnets. This is so because, during beam steering the beam angle is usually modified only by a small amount which can be easily achieved with a trim magnet. Thus in this note, all steering magnets will be assumed to be trim magnets. There are two ways of monitoring beam position. One way is done using a BPM and the other is done using a swic. For simplicity, beam position monitoring in this paper will be referred to being done with a swic. Beam steering can be done manually by changing the current through a trim magnet and monitoring the position of the beam downstream from the magnet with a swic. Alternatively the beam can be positioned automatically using a computer which periodically updates the current through a specific number of trim magnets. The purpose of this note is to describe the steps involved in coming up with such a computer program. There are two main aspects to automatic beam steering. First a relationship between the beam position and the bending magnet is needed. Secondly a beamline setup of swics and trim magnets has to be chosen that will position the beam according to the desired specifications. A simple example will be looked at that will show that once a mathematical relationship between the needed change of the beam position on a swic and the change in trim currents is established, a computer could be programmed to calculate and update the trim currents.
Gradient Learning Algorithms for Ontology Computing
Gao, Wei; Zhu, Linli
2014-01-01
The gradient learning model has been raising great attention in view of its promising perspectives for applications in statistics, data dimensionality reducing, and other specific fields. In this paper, we raise a new gradient learning model for ontology similarity measuring and ontology mapping in multidividing setting. The sample error in this setting is given by virtue of the hypothesis space and the trick of ontology dividing operator. Finally, two experiments presented on plant and humanoid robotics field verify the efficiency of the new computation model for ontology similarity measure and ontology mapping applications in multidividing setting. PMID:25530752
Data bank homology search algorithm with linear computation complexity.
Strelets, V B; Ptitsyn, A A; Milanesi, L; Lim, H A
1994-06-01
A new algorithm for data bank homology search is proposed. The principal advantages of the new algorithm are: (i) linear computation complexity; (ii) low memory requirements; and (iii) high sensitivity to the presence of local region homology. The algorithm first calculates indicative matrices of k-tuple 'realization' in the query sequence and then searches for an appropriate number of matching k-tuples within a narrow range in database sequences. It does not require k-tuple coordinates tabulation and in-memory placement for database sequences. The algorithm is implemented in a program for execution on PC-compatible computers and tested on PIR and GenBank databases with good results. A few modifications designed to improve the selectivity are also discussed. As an application example, the search for homology of the mouse homeotic protein HOX 3.1 is given. PMID:7922689
Application of a new finite difference algorithm for computational aeroacoustics
NASA Technical Reports Server (NTRS)
Goodrich, John W.
1995-01-01
Acoustic problems have become extremely important in recent years because of research efforts such as the High Speed Civil Transport program. Computational aeroacoustics (CAA) requires a faithful representation of wave propagation over long distances, and needs algorithms that are accurate and boundary conditions that are unobtrusive. This paper applies a new finite difference method and boundary algorithm to the Linearized Euler Equations (LEE). The results demonstrate the ability of a new fourth order propagation algorithm to accurately simulate the genuinely multidimensional wave dynamics of acoustic propagation in two space dimensions with the LEE. The results also show the ability of a new outflow boundary condition and fourth order algorithm to pass the evolving solution from the computational domain with no perceptible degradation of the solution remaining within the domain.
Radiation transport algorithms on trans-petaflops supercomputers of different architectures.
Christopher, Thomas Woods
2003-08-01
We seek to understand which supercomputer architecture will be best for supercomputers at the Petaflops scale and beyond. The process we use is to predict the cost and performance of several leading architectures at various years in the future. The basis for predicting the future is an expanded version of Moore's Law called the International Technology Roadmap for Semiconductors (ITRS). We abstract leading supercomputer architectures into chips connected by wires, where the chips and wires have electrical parameters predicted by the ITRS. We then compute the cost of a supercomputer system and the run time on a key problem of interest to the DOE (radiation transport). These calculations are parameterized by the time into the future and the technology expected to be available at that point. We find the new advanced architectures have substantial performance advantages but conventional designs are likely to be less expensive (due to economies of scale). We do not find a universal ''winner'', but instead the right architectural choice is likely to involve non-technical factors such as the availability of capital and how long people are willing to wait for results.
An Agent Inspired Reconfigurable Computing Implementation of a Genetic Algorithm
NASA Technical Reports Server (NTRS)
Weir, John M.; Wells, B. Earl
2003-01-01
Many software systems have been successfully implemented using an agent paradigm which employs a number of independent entities that communicate with one another to achieve a common goal. The distributed nature of such a paradigm makes it an excellent candidate for use in high speed reconfigurable computing hardware environments such as those present in modem FPGA's. In this paper, a distributed genetic algorithm that can be applied to the agent based reconfigurable hardware model is introduced. The effectiveness of this new algorithm is evaluated by comparing the quality of the solutions found by the new algorithm with those found by traditional genetic algorithms. The performance of a reconfigurable hardware implementation of the new algorithm on an FPGA is compared to traditional single processor implementations.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
An algorithm for computationally expensive engineering optimization problems
NASA Astrophysics Data System (ADS)
Yoel, Tenne
2013-07-01
Modern engineering design often relies on computer simulations to evaluate candidate designs, a scenario which results in an optimization of a computationally expensive black-box function. In these settings, there will often exist candidate designs which cause the simulation to fail, and can therefore degrade the search effectiveness. To address this issue, this paper proposes a new metamodel-assisted computational intelligence optimization algorithm which incorporates classifiers into the optimization search. The classifiers predict which candidate designs are expected to cause the simulation to fail, and this prediction is used to bias the search towards designs predicted to be valid. To enhance the search effectiveness, the proposed algorithm uses an ensemble approach which concurrently employs several metamodels and classifiers. A rigorous performance analysis based on a set of simulation-driven design optimization problems shows the effectiveness of the proposed algorithm.
A computational study of routing algorithms for realistic transportation networks
Jacob, R.; Marathe, M.V.; Nagel, K.
1998-12-01
The authors carry out an experimental analysis of a number of shortest path (routing) algorithms investigated in the context of the TRANSIMS (Transportation Analysis and Simulation System) project. The main focus of the paper is to study how various heuristic and exact solutions, associated data structures affected the computational performance of the software developed especially for realistic transportation networks. For this purpose the authors have used Dallas Fort-Worth road network with very high degree of resolution. The following general results are obtained: (1) they discuss and experimentally analyze various one-one shortest path algorithms, which include classical exact algorithms studied in the literature as well as heuristic solutions that are designed to take into account the geometric structure of the input instances; (2) they describe a number of extensions to the basic shortest path algorithm. These extensions were primarily motivated by practical problems arising in TRANSIMS and ITS (Intelligent Transportation Systems) related technologies. Extensions discussed include--(i) time dependent networks, (ii) multi-modal networks, (iii) networks with public transportation and associated schedules. Computational results are provided to empirically compare the efficiency of various algorithms. The studies indicate that a modified Dijkstra`s algorithm is computationally fast and an excellent candidate for use in various transportation planning applications as well as ITS related technologies.
A fast algorithm for sparse matrix computations related to inversion
Li, S.; Wu, W.; Darve, E.
2013-06-01
We have developed a fast algorithm for computing certain entries of the inverse of a sparse matrix. Such computations are critical to many applications, such as the calculation of non-equilibrium Green’s functions G{sup r} and G{sup <} for nano-devices. The FIND (Fast Inverse using Nested Dissection) algorithm is optimal in the big-O sense. However, in practice, FIND suffers from two problems due to the width-2 separators used by its partitioning scheme. One problem is the presence of a large constant factor in the computational cost of FIND. The other problem is that the partitioning scheme used by FIND is incompatible with most existing partitioning methods and libraries for nested dissection, which all use width-1 separators. Our new algorithm resolves these problems by thoroughly decomposing the computation process such that width-1 separators can be used, resulting in a significant speedup over FIND for realistic devices — up to twelve-fold in simulation. The new algorithm also has the added advantage that desired off-diagonal entries can be computed for free. Consequently, our algorithm is faster than the current state-of-the-art recursive methods for meshes of any size. Furthermore, the framework used in the analysis of our algorithm is the first attempt to explicitly apply the widely-used relationship between mesh nodes and matrix computations to the problem of multiple eliminations with reuse of intermediate results. This framework makes our algorithm easier to generalize, and also easier to compare against other methods related to elimination trees. Finally, our accuracy analysis shows that the algorithms that require back-substitution are subject to significant extra round-off errors, which become extremely large even for some well-conditioned matrices or matrices with only moderately large condition numbers. When compared to these back-substitution algorithms, our algorithm is generally a few orders of magnitude more accurate, and our produced round
A scalable parallel graph coloring algorithm for distributed memory computers.
Bozdag, Doruk; Manne, Fredrik; Gebremedhin, Assefaw H.; Catalyurek, Umit; Boman, Erik Gunnar
2005-02-01
In large-scale parallel applications a graph coloring is often carried out to schedule computational tasks. In this paper, we describe a new distributed memory algorithm for doing the coloring itself in parallel. The algorithm operates in an iterative fashion; in each round vertices are speculatively colored based on limited information, and then a set of incorrectly colored vertices, to be recolored in the next round, is identified. Parallel speedup is achieved in part by reducing the frequency of communication among processors. Experimental results on a PC cluster using up to 16 processors show that the algorithm is scalable.
Parallel matrix transpose algorithms on distributed memory concurrent computers
Choi, J.; Walker, D.W.; Dongarra, J.J. |
1993-10-01
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
General purpose architecture for intelligent computer-aided training
NASA Technical Reports Server (NTRS)
Loftin, R. Bowen (Inventor); Wang, Lui (Inventor); Baffes, Paul T. (Inventor); Hua, Grace C. (Inventor)
1994-01-01
An intelligent computer-aided training system having a general modular architecture is provided for use in a wide variety of training tasks and environments. It is comprised of a user interface which permits the trainee to access the same information available in the task environment and serves as a means for the trainee to assert actions to the system; a domain expert which is sufficiently intelligent to use the same information available to the trainee and carry out the task assigned to the trainee; a training session manager for examining the assertions made by the domain expert and by the trainee for evaluating such trainee assertions and providing guidance to the trainee which are appropriate to his acquired skill level; a trainee model which contains a history of the trainee interactions with the system together with summary evaluative data; an intelligent training scenario generator for designing increasingly complex training exercises based on the current skill level contained in the trainee model and on any weaknesses or deficiencies that the trainee has exhibited in previous interactions; and a blackboard that provides a common fact base for communication between the other components of the system. Preferably, the domain expert contains a list of 'mal-rules' which typifies errors that are usually made by novice trainees. Also preferably, the training session manager comprises an intelligent error detection means and an intelligent error handling means. The present invention utilizes a rule-based language having a control structure whereby a specific message passing protocol is utilized with respect to tasks which are procedural or step-by-step in structure. The rules can be activated by the trainee in any order to reach the solution by any valid or correct path.
Beard, R.A.
1990-03-01
The purpose of this thesis is to explore the methods used to parallelize NP-complete problems and the degree of improvement that can be realized using a distributed parallel processor to solve these combinatoric problems. Common NP-complete problem characteristics such as a priori reductions, use of partial-state information, and inhomogeneous searches are identified and studied. The set covering problem (SCP) is implemented for this research because many applications such as information retrieval, task scheduling, and VLSI expression simplification can be structured as an SCP problem. In addition, its generic NP-complete common characteristics are well documented and a parallel implementation has not been reported. Parallel programming design techniques involve decomposing the problem and developing the parallel algorithms. The major components of a parallel solution are developed in a four phase process. First, a meta-level design is accomplished using an appropriate design language such as UNITY. Then, the UNITY design is transformed into an algorithm and implementation specific to a distributed architecture. Finally, a complexity analysis of the algorithm is performed. the a priori reductions are divided-and-conquer algorithms; whereas, the search for the optimal set cover is accomplished with a branch-and-bound algorithm. The search utilizes a global best cost maintained at a central location for distribution to all processors. Three methods of load balancing are implemented and studied: coarse grain with static allocation of the search space, fine grain with dynamic allocation, and dynamic load balancing.
Efficient implementation of Jacobi algorithms and Jacobi sets on distributed memory architectures
NASA Astrophysics Data System (ADS)
Eberlein, P. J.; Park, Haesun
1990-04-01
One-sided methods for implementing Jacobi diagonalization algorithms have been recently proposed for both distributed memory and vector machines. These methods are naturally well suited to distributed memory and vector architectures because of their inherent parallelism and their abundance of vector operations. Also, one-sided methods require substantially less message passing than the two-sided methods, and thus can achieve higher efficiency. We describe in detail the use of the one-sided Jacobi rotation as opposed to the rotation used in the ``Hestenes'' algorithm; we perceive this difference to have been widely misunderstood. Furthermore the one-sided algorithm generalizes to other problems such as the nonsymmetric eigenvalue problem while the Hestenes algorithm does not. We discuss two new implementations for Jacobi sets for a ring connected array of processors and show their isomorphism to the round-robin ordering. Moreover, we show that two implementations produce Jacobi sets in identical orders up to a relabeling. These orderings are optimal in the sense that they complete each sweep in a minimum number of stages with minimal communication. We present implementation results of one-sided Jacobi algorithms using these orderings on the NCUBE/seven hypercube as well as the Intel iPSC/2 hypercube. Finally, we mention how other orderings, and can be, implemented. The number of nonisomorphic Jacobi sets has recently been shown to become infinite with increasing n. The work of this author was supported by National Science Foundation Grant CCR-8813493.
NASA Technical Reports Server (NTRS)
Eberhardt, D. S.; Baganoff, D.; Stevens, K.
1984-01-01
Implicit approximate-factored algorithms have certain properties that are suitable for parallel processing. A particular computational fluid dynamics (CFD) code, using this algorithm, is mapped onto a multiple-instruction/multiple-data-stream (MIMD) computer architecture. An explanation of this mapping procedure is presented, as well as some of the difficulties encountered when trying to run the code concurrently. Timing results are given for runs on the Ames Research Center's MIMD test facility which consists of two VAX 11/780's with a common MA780 multi-ported memory. Speedups exceeding 1.9 for characteristic CFD runs were indicated by the timing results.
Algorithmic techniques for computer vision on a fine-grained parallel machine
Little, J.J.; Blelloch, G.E.; Cass, T.A.
1989-03-01
The authors describe algorithms for several problems from computer vision, and illustrate how they are implemented using a set of primitive parallel operations. The primitives the authors use include general permutations, grid permutations, and the scan operation - a restricted form of the prefix computation. They cover well-known problems allowing us to concentrate on the implementations rather than the problems. First, they describe some simple routines such as border following, computing histograms and filtering. They then discuss several modules built on these routines including edge detection, Hough transforms, and connected component labeling. Finally, they describe how these modules are composed into higher level vision modules. By defining the routines using a set of primitives operations, they abstract away from a particular architecture. In particular, one does not have to worry about features of machines such as the number of processors or whether a tightly connected architecture has a hypercube network or a four-dimensional grid network. One still needs to worry about the relative performance of the primitives on particular machines. The authors discuss the tradeoffs among primitives and try to identify which primitives are most important for particular problems. All the primitives discussed are supported by the Connection Machine (CM), and they outline how they are implemented. They have implemented most of the algorithms described on the Connection Machine.
JPL control/structure interaction test bed real-time control computer architecture
NASA Technical Reports Server (NTRS)
Briggs, Hugh C.
1989-01-01
The Control/Structure Interaction Program is a technology development program for spacecraft that exhibit interactions between the control system and structural dynamics. The program objectives include development and verification of new design concepts - such as active structure - and new tools - such as combined structure and control optimization algorithm - and their verification in ground and possibly flight test. A focus mission spacecraft was designed based upon a space interferometer and is the basis for design of the ground test article. The ground test bed objectives include verification of the spacecraft design concepts, the active structure elements and certain design tools such as the new combined structures and controls optimization tool. In anticipation of CSI technology flight experiments, the test bed control electronics must emulate the computation capacity and control architectures of space qualifiable systems as well as the command and control networks that will be used to connect investigators with the flight experiment hardware. The Test Bed facility electronics were functionally partitioned into three units: a laboratory data acquisition system for structural parameter identification and performance verification; an experiment supervisory computer to oversee the experiment, monitor the environmental parameters and perform data logging; and a multilevel real-time control computing system. The design of the Test Bed electronics is presented along with hardware and software component descriptions. The system should break new ground in experimental control electronics and is of interest to anyone working in the verification of control concepts for large structures.
Adaptive kinetic-fluid solvers for heterogeneous computing architectures
NASA Astrophysics Data System (ADS)
Zabelok, Sergey; Arslanbekov, Robert; Kolobov, Vladimir
2015-12-01
We show feasibility and benefits of porting an adaptive multi-scale kinetic-fluid code to CPU-GPU systems. Challenges are due to the irregular data access for adaptive Cartesian mesh, vast difference of computational cost between kinetic and fluid cells, and desire to evenly load all CPUs and GPUs during grid adaptation and algorithm refinement. Our Unified Flow Solver (UFS) combines Adaptive Mesh Refinement (AMR) with automatic cell-by-cell selection of kinetic or fluid solvers based on continuum breakdown criteria. Using GPUs enables hybrid simulations of mixed rarefied-continuum flows with a million of Boltzmann cells each having a 24 × 24 × 24 velocity mesh. We describe the implementation of CUDA kernels for three modules in UFS: the direct Boltzmann solver using the discrete velocity method (DVM), the Direct Simulation Monte Carlo (DSMC) solver, and a mesoscopic solver based on the Lattice Boltzmann Method (LBM), all using adaptive Cartesian mesh. Double digit speedups on single GPU and good scaling for multi-GPUs have been demonstrated.
NASA Technical Reports Server (NTRS)
1985-01-01
Slides are reproduced that describe the importance of having high performance number crunching and graphics capability. They also indicate the types of research and development underway at Ames Research Center to ensure that, in the near term, Ames is a smart buyer and user, and in the long-term that Ames knows the best possible solutions for number crunching and graphics needs. The drivers for this research are real computational physics applications of interest to Ames and NASA. They are concerned with how to map the applications, and how to maximize the physics learned from the results of the calculations. The computer graphics activities are aimed at getting maximum information from the three-dimensional calculations by using the real time manipulation of three-dimensional data on the Silicon Graphics workstation. Work is underway on new algorithms that will permit the display of experimental results that are sparse and random, the same way that the dense and regular computed results are displayed.
Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2002-01-01
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.
Computational Fluid Dynamics. [numerical methods and algorithm development
NASA Technical Reports Server (NTRS)
1992-01-01
This collection of papers was presented at the Computational Fluid Dynamics (CFD) Conference held at Ames Research Center in California on March 12 through 14, 1991. It is an overview of CFD activities at NASA Lewis Research Center. The main thrust of computational work at Lewis is aimed at propulsion systems. Specific issues related to propulsion CFD and associated modeling will also be presented. Examples of results obtained with the most recent algorithm development will also be presented.
Integrating Computing Resources: A Shared Distributed Architecture for Academics and Administrators.
ERIC Educational Resources Information Center
Beltrametti, Monica; English, Will
1994-01-01
Development and implementation of a shared distributed computing architecture at the University of Alberta (Canada) are described. Aspects discussed include design of the architecture, users' views of the electronic environment, technical and managerial challenges, and the campuswide human infrastructures needed to manage such an integrated…
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.
Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein
2015-12-01
Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed. PMID:26364202
Optical pattern recognition architecture implementing the mean-square error correlation algorithm
Molley, Perry A.
1991-01-01
An optical architecture implementing the mean-square error correlation algorithm, MSE=.SIGMA.[I-R].sup.2 for discriminating the presence of a reference image R in an input image scene I by computing the mean-square-error between a time-varying reference image signal s.sub.1 (t) and a time-varying input image signal s.sub.2 (t) includes a laser diode light source which is temporally modulated by a double-sideband suppressed-carrier source modulation signal I.sub.1 (t) having the form I.sub.1 (t)=A.sub.1 [1+.sqroot.2m.sub.1 s.sub.1 (t)cos (2.pi.f.sub.o t)] and the modulated light output from the laser diode source is diffracted by an acousto-optic deflector. The resultant intensity of the +1 diffracted order from the acousto-optic device is given by: I.sub.2 (t)=A.sub.2 [+2m.sub.2.sup.2 s.sub.2.sup.2 (t)-2.sqroot.2m.sub.2 (t) cos (2.pi.f.sub.o t] The time integration of the two signals I.sub.1 (t) and I.sub.2 (t) on the CCD deflector plane produces the result R(.tau.) of the mean-square error having the form: R(.tau.)=A.sub.1 A.sub.2 {[T]+[2m.sub.2.sup.2.multidot..intg.s.sub.2.sup.2 (t-.tau.)dt]-[2m.sub.1 m.sub.2 cos (2.tau.f.sub.o .tau.).multidot..intg.s.sub.1 (t)s.sub.2 (t-.tau.)dt]} where: s.sub.1 (t) is the signal input to the diode modulation source: s.sub.2 (t) is the signal input to the AOD modulation source; A.sub.1 is the light intensity; A.sub.2 is the diffraction efficiency; m.sub.1 and m.sub.2 are constants that determine the signal-to-bias ratio; f.sub.o is the frequency offset between the oscillator at f.sub.c and the modulation at f.sub.c +f.sub.o ; and a.sub.o and a.sub.1 are constant chosen to bias the diode source and the acousto-optic deflector into their respective linear operating regions so that the diode source exhibits a linear intensity characteristic and the AOD exhibits a linear amplitude characteristic.
Parallel of low-level computer vision algorithms on a multi-DSP system
NASA Astrophysics Data System (ADS)
Liu, Huaida; Jia, Pingui; Li, Lijian; Yang, Yiping
2011-06-01
Parallel hardware becomes a commonly used approach to satisfy the intensive computation demands of computer vision systems. A multiprocessor architecture based on hypercube interconnecting digital signal processors (DSPs) is described to exploit the temporal and spatial parallelism. This paper presents a parallel implementation of low level vision algorithms designed on multi-DSP system. The convolution operation has been parallelized by using redundant boundary partitioning. Performance of the parallel convolution operation is investigated by varying the image size, mask size and the number of processors. Experimental results show that the speedup is close to the ideal value. However, it can be found that the loading imbalance of processor can significantly affect the computation time and speedup of the multi- DSP system.
NML computation algorithms for tree-structured multinomial Bayesian networks.
Kontkanen, Petri; Wettig, Hannes; Myllymäki, Petri
2007-01-01
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks. PMID:18382603
Adaptive-mesh algorithms for computational fluid dynamics
NASA Technical Reports Server (NTRS)
Powell, Kenneth G.; Roe, Philip L.; Quirk, James
1993-01-01
The basic goal of adaptive-mesh algorithms is to distribute computational resources wisely by increasing the resolution of 'important' regions of the flow and decreasing the resolution of regions that are less important. While this goal is one that is worthwhile, implementing schemes that have this degree of sophistication remains more of an art than a science. In this paper, the basic pieces of adaptive-mesh algorithms are described and some of the possible ways to implement them are discussed and compared. These basic pieces are the data structure to be used, the generation of an initial mesh, the criterion to be used to adapt the mesh to the solution, and the flow-solver algorithm on the resulting mesh. Each of these is discussed, with particular emphasis on methods suitable for the computation of compressible flows.
Computations and algorithms in physical and biological problems
NASA Astrophysics Data System (ADS)
Qin, Yu
This dissertation presents the applications of state-of-the-art computation techniques and data analysis algorithms in three physical and biological problems: assembling DNA pieces, optimizing self-assembly yield, and identifying correlations from large multivariate datasets. In the first topic, in-depth analysis of using Sequencing by Hybridization (SBH) to reconstruct target DNA sequences shows that a modified reconstruction algorithm can overcome the theoretical boundary without the need for different types of biochemical assays and is robust to error. In the second topic, consistent with theoretical predictions, simulations using Graphics Processing Unit (GPU) demonstrate how controlling the short-ranged interactions between particles and controlling the concentrations optimize the self-assembly yield of a desired structure, and nonequilibrium behavior when optimizing concentrations is also unveiled by leveraging the computation capacity of GPUs. In the last topic, a methodology to incorporate existing categorization information into the search process to efficiently reconstruct the optimal true correlation matrix for multivariate datasets is introduced. Simulations on both synthetic and real financial datasets show that the algorithm is able to detect signals below the Random Matrix Theory (RMT) threshold. These three problems are representatives of using massive computation techniques and data analysis algorithms to tackle optimization problems, and outperform theoretical boundary when incorporating prior information into the computation.
Plagiarism Detection Algorithm for Source Code in Computer Science Education
ERIC Educational Resources Information Center
Liu, Xin; Xu, Chan; Ouyang, Boyu
2015-01-01
Nowadays, computer programming is getting more necessary in the course of program design in college education. However, the trick of plagiarizing plus a little modification exists among some students' home works. It's not easy for teachers to judge if there's plagiarizing in source code or not. Traditional detection algorithms cannot fit this…
Algorithms, Computation and Mathematics (Fortran Supplement). Teacher's Commentary. Revised Edition.
ERIC Educational Resources Information Center
Charp, Sylvia; And Others
This is the teacher's guide and commentary for the SMSG textbook Algorithms, Computation, and Mathematics (Fortran Supplement). The teacher's commentary provides background information for the teacher, suggestions for activities found in the Fortran Supplement, and answers for exercises and activities. The course is designed for high school…
Optimization of computer-generated binary holograms using genetic algorithms
NASA Astrophysics Data System (ADS)
Cojoc, Dan; Alexandrescu, Adrian
1999-11-01
The aim of this paper is to compare genetic algorithms against direct point oriented coding in the design of binary phase Fourier holograms, computer generated. These are used as fan-out elements for free space optical interconnection. Genetic algorithms are optimization methods which model the natural process of genetic evolution. The configuration of the hologram is encoded to form a chromosome. To start the optimization, a population of different chromosomes randomly generated is considered. The chromosomes compete, mate and mutate until the best chromosome is obtained according to a cost function. After explaining the operators that are used by genetic algorithms, this paper presents two examples with 32 X 32 genes in a chromosome. The crossover type and the number of mutations are shown to be important factors which influence the convergence of the algorithm. GA is demonstrated to be a useful tool to design namely binary phase holograms of complicate structures.
Survivable algorithms and redundancy management in NASA's distributed computing systems
NASA Technical Reports Server (NTRS)
Malek, Miroslaw
1992-01-01
The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given.
NASA Technical Reports Server (NTRS)
Matthews, Bryan L.; Srivastava, Ashok N.
2010-01-01
Prior to the launch of STS-119 NASA had completed a study of an issue in the flow control valve (FCV) in the Main Propulsion System of the Space Shuttle using an adaptive learning method known as Virtual Sensors. Virtual Sensors are a class of algorithms that estimate the value of a time series given other potentially nonlinearly correlated sensor readings. In the case presented here, the Virtual Sensors algorithm is based on an ensemble learning approach and takes sensor readings and control signals as input to estimate the pressure in a subsystem of the Main Propulsion System. Our results indicate that this method can detect faults in the FCV at the time when they occur. We use the standard deviation of the predictions of the ensemble as a measure of uncertainty in the estimate. This uncertainty estimate was crucial to understanding the nature and magnitude of transient characteristics during startup of the engine. This paper overviews the Virtual Sensors algorithm and discusses results on a comprehensive set of Shuttle missions and also discusses the architecture necessary for deploying such algorithms in a real-time, closed-loop system or a human-in-the-loop monitoring system. These results were presented at a Flight Readiness Review of the Space Shuttle in early 2009.
Paramagnetic materials and practical algorithmic cooling for NMR quantum computing
NASA Astrophysics Data System (ADS)
Fernandez, Jose M.; Mor, Tal; Weinstein, Yossi
2003-08-01
Algorithmic cooling is a method devised by Boykin, Mor Rowchodhury, Vatan and Vrijen (PNAS Mar '02) for initializing NMR systems in general and NMR quantum computers in particular. The algorithm recursively employs two steps. The first is an adiabatic entropy compression of the computation qubits of the system. The second step is an isothermal heat transfer from the system to the environment through a set of reset qubits that reach thermal relaxation rapidly. To allow experimental algorithmic cooling, the thermalization time of the reset qubits must be much shorter than the thermalization time of the computation qubits. We investigated the effect of the paramagnetic material Chromium Acetylacetonate on the thermalization times of computation qubits (carbons) and reset qubit (hydrogen). We report here the accomplishment of an improved ratio of the thermalization times from T1(H)/T1(C) of approximately 5 to around 15. The magnetic ions from the Chromium Acetylacetonate interact with the reset qubit reducing their thermalization time, while their effect on the less exposed computation qubits is found to be weaker. An experimental demonstrating of non adiabatic cooling by thermalization and magnetic ion is currently performed by our group based on these results.
Architectural and Algorithmic Requirements for a Next-Generation System Analysis Code
V.A. Mousseau
2010-05-01
This document presents high-level architectural and system requirements for a next-generation system analysis code (NGSAC) to support reactor safety decision-making by plant operators and others, especially in the context of light water reactor plant life extension. The capabilities of NGSAC will be different from those of current-generation codes, not only because computers have evolved significantly in the generations since the current paradigm was first implemented, but because the decision-making processes that need the support of next-generation codes are very different from the decision-making processes that drove the licensing and design of the current fleet of commercial nuclear power reactors. The implications of these newer decision-making processes for NGSAC requirements are discussed, and resulting top-level goals for the NGSAC are formulated. From these goals, the general architectural and system requirements for the NGSAC are derived.
An Efficient VLSI Architecture of the Enhanced Three Step Search Algorithm
NASA Astrophysics Data System (ADS)
Biswas, Baishik; Mukherjee, Rohan; Saha, Priyabrata; Chakrabarti, Indrajit
2015-02-01
The intense computational complexity of any video codec is largely due to the motion estimation unit. The Enhanced Three Step Search is a popular technique that can be adopted for fast motion estimation. This paper proposes a novel VLSI architecture for the implementation of the Enhanced Three Step Search Technique. A new addressing mechanism has been introduced which enhances the speed of operation and reduces the area requirements. The proposed architecture when implemented in Verilog HDL on Virtex-5 Technology and synthesized using Xilinx ISE Design Suite 14.1 achieves a critical path delay of 4.8 ns while the area comes out to be 2.9K gate equivalent. It can be incorporated in commercial devices like smart-phones, camcorders, video conferencing systems etc.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
An algorithm for computing the distance spectrum of trellis codes
NASA Technical Reports Server (NTRS)
Rouanne, Marc; Costello, Daniel J., Jr.
1989-01-01
A class of quasiregular codes is defined for which the distance spectrum can be calculated from the codeword corresponding to the all-zero information sequence. Convolutional codes and regular codes are both quasiregular, as well as most of the best known trellis codes. An algorithm to compute the distance spectrum of linear, regular, and quasiregular trellis codes is presented. In particular, it can calculate the weight spectrum of convolutional (linear trellis) codes and the distance spectrum of most of the best known trellis codes. The codes do not have to be linear or regular, and the signals do not have to be used with equal probabilities. The algorithm is derived from a bidirectional stack algorithm, although it could also be based on the Viterbi algorithm. The algorithm is used to calculate the beginning of the distance spectrum of some of the best known trellis codes and to compute tight estimates on the first-event-error probability and on the bit-error probability.
NASA Astrophysics Data System (ADS)
Kim, Ji-hyun; Aum, Jaehong; Han, Jae-Ho; Jeong, Jichai
2015-01-01
We propose an optimized signal processing scheme that utilizes the compute unified device architecture (CUDA) for real-time spectral domain optical coherence tomography (OCT). Because linear spline interpolation and the direct spectral reshaping method have low data and control dependencies, these algorithms maximally utilize graphic processing unit (GPU) resources for dispersion control. In addition, data transfer between main memory and GPU, regarded as one of the most wasteful and time-consuming processes in GPU computing, is executed in parallel with the signal processing by overlapping kernel execution and data transfers. Experimental results obtained from application of the proposed scheme to a laboratory constructed OCT system comprising five spectrally shifted SLDs indicate that the OCT system has an axial resolution of 4.8 μm and transverse resolution of 13 μm in air. Further, coherence artifacts are reduced by 3-14 dB over the side-lobes in the point spread function. The optimization of CUDA enables OCT imaging rates up to 350 kHz (A-lines/sec) with a single GTX680 GPU.
The algorithmic level is the bridge between computation and brain.
Love, Bradley C
2015-04-01
Every scientist chooses a preferred level of analysis and this choice shapes the research program, even determining what counts as evidence. This contribution revisits Marr's (1982) three levels of analysis (implementation, algorithmic, and computational) and evaluates the prospect of making progress at each individual level. After reviewing limitations of theorizing within a level, two strategies for integration across levels are considered. One is top-down in that it attempts to build a bridge from the computational to algorithmic level. Limitations of this approach include insufficient theoretical constraint at the computation level to provide a foundation for integration, and that people are suboptimal for reasons other than capacity limitations. Instead, an inside-out approach is forwarded in which all three levels of analysis are integrated via the algorithmic level. This approach maximally leverages mutual data constraints at all levels. For example, algorithmic models can be used to interpret brain imaging data, and brain imaging data can be used to select among competing models. Examples of this approach to integration are provided. This merging of levels raises questions about the relevance of Marr's tripartite view. PMID:25823496
Ehsan, Shoaib; Clark, Adrian F; Naveed ur Rehman; McDonald-Maier, Klaus D
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
Reliable ISR algorithms for a very-low-power approximate computer
NASA Astrophysics Data System (ADS)
Eaton, Ross S.; McBride, Jonah C.; Bates, Joseph
2013-05-01
The Office of Naval Research (ONR) is looking for methods to perform higher levels of sensor processing onboard UAVs to alleviate the need to transmit full motion video to ground stations over constrained data links. Charles River Analytics is particularly interested in performing intelligence, surveillance, and reconnaissance (ISR) tasks using UAV sensor feeds. Computing with approximate arithmetic can provide 10,000x improvement in size, weight, and power (SWAP) over desktop CPUs, thereby enabling ISR processing onboard small UAVs. Charles River and Singular Computing are teaming on an ONR program to develop these low-SWAP ISR capabilities using a small, low power, single chip machine, developed by Singular Computing, with many thousands of cores. Producing reliable results efficiently on massively parallel approximate machines requires adapting the core kernels of algorithms. We describe a feature-aided tracking algorithm adapted for the novel hardware architecture, which will be suitable for use onboard a UAV. Tests have shown the algorithm produces results equivalent to state-of-the-art traditional approaches while achieving a 6400x improvement in speed/power ratio.
Ehsan, Shoaib; Clark, Adrian F.; ur Rehman, Naveed; McDonald-Maier, Klaus D.
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
Computational Discovery of Materials Using the Firefly Algorithm
NASA Astrophysics Data System (ADS)
Avendaño-Franco, Guillermo; Romero, Aldo
Our current ability to model physical phenomena accurately, the increase computational power and better algorithms are the driving forces behind the computational discovery and design of novel materials, allowing for virtual characterization before their realization in the laboratory. We present the implementation of a novel firefly algorithm, a population-based algorithm for global optimization for searching the structure/composition space. This novel computation-intensive approach naturally take advantage of concurrency, targeted exploration and still keeping enough diversity. We apply the new method in both periodic and non-periodic structures and we present the implementation challenges and solutions to improve efficiency. The implementation makes use of computational materials databases and network analysis to optimize the search and get insights about the geometric structure of local minima on the energy landscape. The method has been implemented in our software PyChemia, an open-source package for materials discovery. We acknowledge the support of DMREF-NSF 1434897 and the Donors of the American Chemical Society Petroleum Research Fund for partial support of this research under Contract 54075-ND10.
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
Neylon, J. Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.
2014-10-15
Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria
ERIC Educational Resources Information Center
Amenyo, John-Thones
2012-01-01
Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…
An efficient FPGA architecture for integer ƞth root computation
NASA Astrophysics Data System (ADS)
Rangel-Valdez, Nelson; Barron-Zambrano, Jose Hugo; Torres-Huitzil, Cesar; Torres-Jimenez, Jose
2015-10-01
In embedded computing, it is common to find applications such as signal processing, image processing, computer graphics or data compression that might benefit from hardware implementation for the computation of integer roots of order ?. However, the scientific literature lacks architectural designs that implement such operations for different values of N, using a low amount of resources. This article presents a parameterisable field programmable gate array (FPGA) architecture for an efficient Nth root calculator that uses only adders/subtractors and ? location memory elements. The architecture was tested for different values of ?, using 64-bit number representation. The results show a consumption up to 10% of the logical resources of a Xilinx XC6SLX45-CSG324C device, depending on the value of N. The hardware implementation improved the performance of its corresponding software implementations in one order of magnitude. The architecture performance varies from several thousands to seven millions of root operations per second.
State-Estimation Algorithm Based on Computer Vision
NASA Technical Reports Server (NTRS)
Bayard, David; Brugarolas, Paul
2007-01-01
An algorithm and software to implement the algorithm are being developed as means to estimate the state (that is, the position and velocity) of an autonomous vehicle, relative to a visible nearby target object, to provide guidance for maneuvering the vehicle. In the original intended application, the autonomous vehicle would be a spacecraft and the nearby object would be a small astronomical body (typically, a comet or asteroid) to be explored by the spacecraft. The algorithm could also be used on Earth in analogous applications -- for example, for guiding underwater robots near such objects of interest as sunken ships, mineral deposits, or submerged mines. It is assumed that the robot would be equipped with a vision system that would include one or more electronic cameras, image-digitizing circuitry, and an imagedata- processing computer that would generate feature-recognition data products.
Memory intensive functional architecture for distributed computer control systems
Dimmler, D.G.
1983-10-01
A memory-intensive functional architectue for distributed data-acquisition, monitoring, and control systems with large numbers of nodes has been conceptually developed and applied in several large-scale and some smaller systems. This discussion concentrates on: (1) the basic architecture; (2) recent expansions of the architecture which now become feasible in view of the rapidly developing component technologies in microprocessors and functional large-scale integration circuits; and (3) implementation of some key hardware and software structures and one system implementation which is a system for performing control and data acquisition of a neutron spectrometer at the Brookhaven High Flux Beam Reactor. The spectrometer is equipped with a large-area position-sensitive neutron detector.
Architectural issues in fault-tolerant, secure computing systems
Joseph, M.K.
1988-01-01
This dissertation explores several facets of the applicability of fault-tolerance techniques to secure computer design, these being: (1) how fault-tolerance techniques can be used on unsolved problems in computer security (e.g., computer viruses, and denial-of-service); (2) how fault-tolerance techniques can be used to support classical computer-security mechanisms in the presence of accidental and deliberate faults; and (3) the problems involved in designing a fault-tolerant, secure computer system (e.g., how computer security can degrade along with both the computational and fault-tolerance capabilities of a computer system). The approach taken in this research is almost as important as its results. It is different from current computer-security research in that a design paradigm for fault-tolerant computer design is used. This led to an extensive fault and error classification of many typical security threats. Throughout this work, a fault-tolerance perspective is taken. However, the author did not ignore basic computer-security technology. For some problems he investigated how to support and extend basic-security mechanism (e.g., trusted computing base), instead of trying to achieve the same result with purely fault-tolerance techniques.
High pressure humidification columns: Design equations, algorithm, and computer code
Enick, R.M.; Klara, S.M.; Marano, J.J.
1994-07-01
This report describes the detailed development of a computer model to simulate the humidification of an air stream in contact with a water stream in a countercurrent, packed tower, humidification column. The computer model has been developed as a user model for the Advanced System for Process Engineering (ASPEN) simulator. This was done to utilize the powerful ASPEN flash algorithms as well as to provide ease of use when using ASPEN to model systems containing humidification columns. The model can easily be modified for stand-alone use by incorporating any standard algorithm for performing flash calculations. The model was primarily developed to analyze Humid Air Turbine (HAT) power cycles; however, it can be used for any application that involves a humidifier or saturator. The solution is based on a multiple stage model of a packed column which incorporates mass and energy, balances, mass transfer and heat transfer rate expressions, the Lewis relation and a thermodynamic equilibrium model for the air-water system. The inlet air properties, inlet water properties and a measure of the mass transfer and heat transfer which occur in the column are the only required input parameters to the model. Several example problems are provided to illustrate the algorithm`s ability to generate the temperature of the water, flow rate of the water, temperature of the air, flow rate of the air and humidity of the air as a function of height in the column. The algorithm can be used to model any high-pressure air humidification column operating at pressures up to 50 atm. This discussion includes descriptions of various humidification processes, detailed derivations of the relevant expressions, and methods of incorporating these equations into a computer model for a humidification column.
Computer architecture providing high-performance and low-cost solutions for fast fMRI reconstruction
NASA Astrophysics Data System (ADS)
Chao, Hui; Goddard, J. Iain
1998-07-01
Due to the dynamic nature of brain studies in functional magnetic resonance imaging (fMRI), fast pulse sequences such as echo planar imaging (EPI) and spiral are often used for higher temporal resolution. Hundreds of frames of two- dimensional (2-D) images or multiple three-dimensional (3-D) images are often acquired to cover a larger space and time range. Therefore, fMRI often requires a much larger data storage, faster data transfer rate and higher processing power than conventional MRI. In Mercury Computer Systems' PCI-based embedded computer system, the computer architecture allows the concurrent use of a DMA engine for data transfer and CPU for data processing. This architecture allows a multicomputer to distribute processing and data with minimal time spent transferring data. Different types and numbers of processors are available to optimize system performance for the application. The fMRI reconstruction was first implemented in Mercury's PCI-based embedded computer system by using one digital signal processing (DSP) chip, with the host computer running under the Windows NTR platform. Double buffers in SRAM or cache were created for concurrent I/O and processing. The fMRI reconstruction was then implemented in parallel using multiple DSP chips. Data transfer and interprocessor synchronization were carefully managed to optimize algorithm efficiency. The image reconstruction times were measured with different numbers of processors ranging from one to 10. With one DSP chip, the timing for reconstructing 100 fMRI images measuring 128 X 64 pixels was 1.24 seconds, which is already faster than most existing commercial MRI systems. This PCI-based embedded multicomputer architecture, which has a nearly linear improvement in performance, provides high performance for fMRI processing. In summary, this embedded multicomputer system allows the choice of computer topologies to fit the specific application to achieve maximum system performance.
Computation of synthetic mammograms with an edge-weighting algorithm
NASA Astrophysics Data System (ADS)
Homann, Hanno; Bergner, Frank; Erhard, Klaus
2015-03-01
The promising increase in cancer detection rates1, 2 makes digital breast tomosynthesis (DBT) an interesting alternative to full-field digital mammography (FFDM) in breast cancer screening. However, this benefit comes at the cost of an increased average glandular dose in a combined DBT plus FFDM acquisition protocol. Synthetic mammograms, which are computed from the reconstructed tomosynthesis volume data, have demonstrated to be an alternative to a regular FFDM exposure in a DBT plus synthetic 2D reading mode.3 Besides weighted averaging and modified maximum intensity projection (MIP) methods,4, 5 the integration of CAD techniques for computing a weighting function in the forward projection step of the synthetic mammogram generation has been recently proposed.6, 7 In this work, a novel and computationally efficient method is presented based on an edge-retaining algorithm, which directly computes the weighting function by an edge-detection filter.
Reference Architecture for High Dependability On-Board Computers
NASA Astrophysics Data System (ADS)
Silva, Nuno; Esper, Alexandre; Zandin, Johan; Barbosa, Ricardo; Monteleone, Claudio
2014-08-01
The industrial process in the area of on-board computers is characterized by small production series of on-board computers (hardware and software) configuration items with little recurrence at unit or set level (e.g. computer equipment unit, set of interconnected redundant units). These small production series result into a reduced amount of statistical data related to dependability, which influence on the way on-board computers are specified, designed and verified. In the context of ESA harmonization policy for the deployment of enhanced and homogeneous industrial processes in the area of avionics embedded systems and on-board computers for the space industry, this study aimed at rationalizing the initiation phase of the development or procurement of on-board computers and at improving dependability assurance. This aim was achieved by establishing generic requirements for the procurement or development of on-board computers with a focus on well-defined reliability, availability, and maintainability requirements, as well as a generic methodology for planning, predicting and assessing the dependability of on-board computers hardware and software throughout their life cycle. It also provides guidelines for producing evidence material and arguments to support dependability assurance of on-board computers hardware and software throughout the complete lifecycle, including an assessment of feasibility aspects of the dependability assurance process and how the use of computer-aided environment can contribute to the on-board computer dependability assurance.
Algorithmic and Experimental Computation of Higher-Order Safe Primes
NASA Astrophysics Data System (ADS)
Díaz, R. Durán; Masqué, J. Muñoz
2008-09-01
This paper deals with a class of special primes called safe primes. In the regular definition, an odd prime p is safe if, at least, one of (p±1)/2 is prime. Safe primes have been recommended as factors of RSA moduli. In this paper, the concept of safe primes is extended to higher-order safe primes, and an explicit formula to compute the density of this class of primes in the set of the integers is supplied. Finally, explicit conditions are provided permitting the algorithmic computation of safe primes of arbitrary order. Some experimental results are provided as well.
Sort-Mid tasks scheduling algorithm in grid computing.
Reda, Naglaa M; Tawfik, A; Marzok, Mohamed A; Khamis, Soheir M
2015-11-01
Scheduling tasks on heterogeneous resources distributed over a grid computing system is an NP-complete problem. The main aim for several researchers is to develop variant scheduling algorithms for achieving optimality, and they have shown a good performance for tasks scheduling regarding resources selection. However, using of the full power of resources is still a challenge. In this paper, a new heuristic algorithm called Sort-Mid is proposed. It aims to maximizing the utilization and minimizing the makespan. The new strategy of Sort-Mid algorithm is to find appropriate resources. The base step is to get the average value via sorting list of completion time of each task. Then, the maximum average is obtained. Finally, the task has the maximum average is allocated to the machine that has the minimum completion time. The allocated task is deleted and then, these steps are repeated until all tasks are allocated. Experimental tests show that the proposed algorithm outperforms almost other algorithms in terms of resources utilization and makespan. PMID:26644937
A Moving Target Environment for Computer Configurations Using Genetic Algorithms
Crouse, Michael; Fulp, Errin W.
2011-10-31
Moving Target (MT) environments for computer systems provide security through diversity by changing various system properties that are explicitly defined in the computer configuration. Temporal diversity can be achieved by making periodic configuration changes; however in an infrastructure of multiple similarly purposed computers diversity must also be spatial, ensuring multiple computers do not simultaneously share the same configuration and potential vulnerabilities. Given the number of possible changes and their potential interdependencies discovering computer configurations that are secure, functional, and diverse is challenging. This paper describes how a Genetic Algorithm (GA) can be employed to find temporally and spatially diverse secure computer configurations. In the proposed approach a computer configuration is modeled as a chromosome, where an individual configuration setting is a trait or allele. The GA operates by combining multiple chromosomes (configurations) which are tested for feasibility and ranked based on performance which will be measured as resistance to attack. The result of successive iterations of the GA are secure configurations that are diverse due to the crossover and mutation processes. Simulations results will demonstrate this approach can provide at MT environment for a large infrastructure of similarly purposed computers by discovering temporally and spatially diverse secure configurations.
Architectures and algorithms for all-optical 3D signal processing
NASA Astrophysics Data System (ADS)
Giglmayr, Josef
1999-07-01
All-optical signal processing by >= 2D lightwave circuits (LCs) is (i) aimed to allow the (later) inclusion of the frequency domain and is (ii) subject to photonic integration and thus the architectural and algorithmic framework has to be prepared carefully. Much work has been done in >= 2D algebraic system theory/modern control theory which has been applied in the electronic field of signal and image processing. For the application to modeling, analysis and design of the proposed 3D lightwave circuits (LCs) some elements are needed to describe and evalute the system efficiency as the number of system states of 3D LCs increases dramatically with regard to the number of i/o. Several problems, arising throughput such an attempt, are made transparent and solutions are proposed.
Self-reconfigurable approach for computation-intensive motion estimation algorithm in H.264/AVC
NASA Astrophysics Data System (ADS)
Lee, Jooheung; Ryu, Chul; Kim, Soontae
2012-04-01
The authors propose a self-reconfigurable approach to perform H.264/AVC variable block size motion estimation computation on field-programmable gate arrays. We use dynamic partial reconfiguration to change the hardware architecture of motion estimation during run-time. Hardware adaptation to meet the real-time computing requirements for the given video resolutions and frame rates is performed through self-reconfiguration. An embedded processor is used to control the reconfiguration of partial bitstreams of motion estimation adaptively. The partial bitstreams for different motion estimation computation arrays are compressed using LZSS algorithm. On-chip BlockRAM is used as a cache to pre-store the partial bitstreams so that run-time reconfiguration can be fully utilized. We designed a hardware module to fetch the pre-stored partial bitstream from BlockRAM to an internal configuration access port. Comparison results show that our motion estimation architecture improves toward data reuse, and the memory bandwidth overhead is reduced. Using our self-reconfigurable platform, the reconfiguration overhead can be removed and 367 MB/sec reconfiguration rate can be achieved. The experimental results show that the external memory accesses are reduced by 62.4% and it can operate at a frequency of 91.7 MHz.
On-Board Computing Subsystem for MIRAX: Architectural and Interface Aspects
Santiago, Valdivino
2006-06-09
This paper presents some proposals of architecture and interfaces among the different types of processing units of MIRAX on-board computing subsystem. MIRAX satellite payload is composed of dedicated computers, two Hard X-Ray cameras and one Soft X-Ray camera (WFC flight spare unit from BeppoSAX satellite). The architectures for the On-Board Computing Subsystem will take into account hardware or software solution of the event preprocessing for CdZnTe detectors. Hardware and software interfaces approaches will be shown and also requirements of on-board memory storage and telemetry will be addressed.
Architectural concepts and redundancy techniques in fault-tolerant computers
NASA Technical Reports Server (NTRS)
Rennels, D. A.
1974-01-01
This paper presents a description of redundancy techniques employed in the design of fault-tolerant computers, and a discussion of the effects of functional requirements, technology constraints, and cost considerations which enter into the choice of these techniques. The STAR computer, developed at the Jet Propulsion Laboratory for long-duration planetary spacecraft missions, is discussed along with several later fault-tolerant computer designs. The class of computers described in this paper employs dynamic redundancy, i.e., the machine is divided into a set of submodules, each with standby spares; a special hard core monitor unit detects and diagnoses faults, and effects automated recovery by replacing failed parts.
Opportunities for X-ray Science in Future Computing Architectures
Foster, Ian
2011-02-09
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends show no sign of slowing down. The next 10 years will surely see exascale, new cloud offerings, and other terabit networks. This talk reviews various of these developments and discusses their potential implications for x-ray science and x-ray facilities.
Spin-qubit inspired architectures for superconducting quantum computing
NASA Astrophysics Data System (ADS)
Shim, Yun-Pil; Tahan, Charles
2015-03-01
In recent years, the superconducting qubit community has achieved single and two-qubit benchmarked gate fidelities approaching 99.9%, fast readout with novel superconducting amplifiers, distributed entanglement, and other milestones on the road to fault-tolerant quantum information processing. Obviously, this is a field that could use some help from the semiconductor qubit community! Here we present theoretical work on superconducting qubit systems inspired by our experience with semiconductor qubits. We discuss initialization, single- and two-qubit gate operations, and measurement schemes for an encoded qubit in a two-dimensional architecture. Our results motivate new ways of designing or operating superconducting quantum information processors.
ERIC Educational Resources Information Center
Nikolic, B.; Radivojevic, Z.; Djordjevic, J.; Milutinovic, V.
2009-01-01
Courses in Computer Architecture and Organization are regularly included in Computer Engineering curricula. These courses are usually organized in such a way that students obtain not only a purely theoretical experience, but also a practical understanding of the topics lectured. This practical work is usually done in a laboratory using simulators…
ERIC Educational Resources Information Center
Stanley, Timothy D.; Wong, Lap Kei; Prigmore, Daniel; Benson, Justin; Fishler, Nathan; Fife, Leslie; Colton, Don
2007-01-01
Students learn better when they both hear and do. In computer architecture courses "doing" can be difficult in small schools without hardware laboratories hosted by computer engineering, electrical engineering, or similar departments. Software solutions exist. Our success with George Mills' Multimedia Logic (MML) is the focus of this paper. MML…
A Project-Based Learning Approach to Programmable Logic Design and Computer Architecture
ERIC Educational Resources Information Center
Kellett, C. M.
2012-01-01
This paper describes a course in programmable logic design and computer architecture as it is taught at the University of Newcastle, Australia. The course is designed around a major design project and has two supplemental assessment tasks that are also described. The context of the Computer Engineering degree program within which the course is…
Formiconi, A R; Passeri, A; Guelfi, M R; Masoni, M; Pupi, A; Meldolesi, U; Malfetti, P; Calori, L; Guidazzoli, A
1997-11-01
Data from Single Photon Emission Computed Tomography (SPECT) studies are blurred by inevitable physical phenomena occurring during data acquisition. These errors may be compensated by means of reconstruction algorithms which take into account accurate physical models of the data acquisition procedure. Unfortunately, this approach involves high memory requirements as well as a high computational burden which cannot be afforded by the computer systems of SPECT acquisition devices. In this work the possibility of accessing High Performance Computing and Networking (HPCN) resources through a World Wide Web interface for the advanced reconstruction of SPECT data in a clinical environment was investigated. An iterative algorithm with an accurate model of the variable system response was ported on the Multiple Instruction Multiple Data (MIMD) parallel architecture of a Cray T3D massively parallel computer. The system was accessible even from low cost PC-based workstations through standard TCP/IP networking. A speedup factor of 148 was predicted by the benchmarks run on the Cray T3D. A complete brain study of 30 (64 x 64) slices was reconstructed from a set of 90 (64 x 64) projections with ten iterations of the conjugate gradients algorithm in 9 s which corresponds to an actual speed-up factor of 135. The technique was extended to a more accurate 3D modeling of the system response for a true 3D reconstruction of SPECT data; the reconstruction time of the same data set with this more accurate model was 5 min. This work demonstrates the possibility of exploiting remote HPCN resources from hospital sites by means of low cost workstations using standard communication protocols and an user-friendly WWW interface without particular problems for routine use. PMID:9506406
NASA Technical Reports Server (NTRS)
Duff, Michael J. B. (Editor); Siegel, Howard J. (Editor); Corbett, Francis J. (Editor)
1986-01-01
The conference presents papers on the architectures, algorithms, and applications of image processing. Particular attention is given to a very large scale integration system for image reconstruction from projections, a prebuffer algorithm for instant display of volume data, and an adaptive image sequence filtering scheme based on motion detection. Papers are also presented on a simple, direct practical method of sensing local motion and analyzing local optical flow, image matching techniques, and an automated biological dosimetry system.
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
NASA Astrophysics Data System (ADS)
Gurcan, Metin N.; Chan, Heang-Ping; Sahiner, Berkman; Hadjiiski, Lubomir M.; Petrick, Nicholas; Helvie, Mark A.
2002-05-01
We evaluated the effectiveness of an optimal convolution neural network (CNN) architecture selected by simulated annealing for improving the performance of a computer-aided diagnosis (CAD) system designed for the detection of microcalcification clusters on digitized mammograms. The performances of the CAD programs with manually and optimally selected CNNs were compared using an independent test set. This set included 472 mammograms and contained 253 biopsy-proven malignant clusters. Free-response receiver operating characteristic (FROC) analysis was used for evaluation of the detection accuracy. At a false positive (FP) rate of 0.7 per image, the film-based sensitivity was 84.6% with the optimized CNN, in comparison with 77.2% with the manually selected CNN. If clusters having images in both craniocaudal and mediolateral oblique views were analyzed together and a cluster was considered to be detected when it was detected in one or both views, at 0.7 FPs/image, the sensitivity was 93.3% with the optimized CNN and 87.0% with the manually selected CNN. This study indicates that classification of true positive and FP signals is an important step of the CAD program and that the detection accuracy of the program can be considerably improved by optimizing this step with an automated optimization algorithm.
A learnable parallel processing architecture towards unity of memory and computing
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-01-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-01-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243
Algorithms for computing and integrating physical maps using unique probes.
Jain, M; Myers, E W
1997-01-01
Current physical mapping projects based on STS-probes involve additional clues such as the fact that some probes are anchored to a known map and that others come from the ends of clones. Because of the disparate combinatorial contributions of these varied data items, it is difficult to design a "tailored" algorithm that incorporates them all. Moreover, it is inevitable that new experiments will provide new kinds of data, making obsolete any such algorithm. We show how to convert the physical mapping problem into a 0/1 linear programming (LP) problem. We further show how one can incorporate additional clues as additional constraints in the LP formulation. We give a simple relaxation of the 0/1 LP problem, which solves problems of the same scale as previously reported tailored algorithms, to equal or greater optimization levels. We also present a theorem proving that when the data is 100% accurate, then the relaxed and integer solutions coincide. The LP algorithm suffices to solve problems on the order of 80-100 probes--the typical size of the 2- or 3-connected contigs of Arratia et al. (1991). We give a heuristic algorithm which attempts to order and link the set of LP-solved contigs. Unlike previous work, this algorithm only links and orders contigs when the join is 90% or more likely to be correct. It is our view that there is no value in computing an optimal solution with respect to some criteria over very noisy data as this optimal solution rarely corresponds to the true solution. The paper involves extensive empirical trials over real and simulated data. PMID:9385539
NASA Technical Reports Server (NTRS)
Saltz, J. H.; Naik, V. K.; Nicol, D. M.
1986-01-01
The efficient implementation of algorithms on multiprocessor machines requires that the effects of communication delays be minimized. The effects of these delays on the performance of a model problem on a hypercube multiprocessor architecture is investigated and methods are developed for increasing algorithm efficiency. The model problem under investigation is the solution by red-black Successive Over Relaxation YOUN71 of the heat equation; most of the techniques described here also apply equally well to the solution of elliptic partial differential equations by red-black or multicolor SOR methods. Methods for reducing communication traffic and overhead on a multiprocessor are identified and results of testing these methods on the Intel iPSC Hypercube reported. Methods for partitioning a problem's domain across processors, for reducing communication traffic during a global convergence check, for reducing the number of global convergence checks employed during an iteration, and for concurrently iterating on multiple time-steps in a time-dependent problem. Empirical results show that use of these models can markedly reduce a numewrical problem's execution time.
PACCE: Perl Algorithm to Compute Continuum and Equivalent Widths
NASA Astrophysics Data System (ADS)
Riffel, Rogério; Borges Vale, Tibério
2011-05-01
We present Perl Algorithm to Compute continuum and Equivalent Widths (pacce). We describe the methods used in the computations and the requirements for its usage. We compare the measurements made with pacce and "manual" ones made using iraf splot task. These tests show that for SSP models the equivalent widths strengths are very similar (differences <0.2A) for both measurements. In real stellar spectra, the correlation between both values is still very good, but with differences of up to 0.5A. pacce is also able to determine mean continuum and continuum at line center values, which are helpful in stellar population studies. In addition, it is also able to compute the uncertainties in the equivalent widths using photon statistics.
Distributed Computing Architecture for Image-Based Wavefront Sensing and 2 D FFTs
NASA Technical Reports Server (NTRS)
Smith, Jeffrey S.; Dean, Bruce H.; Haghani, Shadan
2006-01-01
Image-based wavefront sensing (WFS) provides significant advantages over interferometric-based wavefi-ont sensors such as optical design simplicity and stability. However, the image-based approach is computational intensive, and therefore, specialized high-performance computing architectures are required in applications utilizing the image-based approach. The development and testing of these high-performance computing architectures are essential to such missions as James Webb Space Telescope (JWST), Terrestial Planet Finder-Coronagraph (TPF-C and CorSpec), and Spherical Primary Optical Telescope (SPOT). The development of these specialized computing architectures require numerous two-dimensional Fourier Transforms, which necessitate an all-to-all communication when applied on a distributed computational architecture. Several solutions for distributed computing are presented with an emphasis on a 64 Node cluster of DSPs, multiple DSP FPGAs, and an application of low-diameter graph theory. Timing results and performance analysis will be presented. The solutions offered could be applied to other all-to-all communication and scientifically computationally complex problems.
Liu, Yinxiao; Jin, Dakai; Saha, Punam K.
2015-01-01
Adult bone diseases, especially osteoporosis, lead to increased risk of fracture associated with substantial morbidity, mortality, and financial costs. Clinically, osteoporosis is defined by low bone mineral density (BMD); however, increasing evidence suggests that the micro-architectural quality of trabecular bone (TB) is an important determinant of bone strength and fracture risk. Accurate measurement of trabecular thickness and marrow spacing is of significant interest for early diagnosis of osteoporosis or treatment effects. Here, we present a new robust algorithm for computing TB thickness and marrow spacing at a low resolution achievable in vivo. The method uses a star-line tracing technique that effectively deals with partial voluming effects of in vivo imaging where voxel size is comparable to TB thickness. Experimental results on cadaveric ankle specimens have demonstrated the algorithm’s robustness (ICC>0.98) under repeat scans of multi-row detector computed tomography (MD-CT) imaging. It has been observed in experimental results that TB thickness and marrow spacing measures as computed by the new algorithm have strong association (R2 ∈{0.85, 0.87}) with TB’s experimental mechanical strength measures. PMID:27330678
Arranging computer architectures to create higher-performance controllers
NASA Technical Reports Server (NTRS)
Jacklin, Stephen A.
1988-01-01
Techniques for integrating microprocessors, array processors, and other intelligent devices in control systems are reviewed, with an emphasis on the (re)arrangement of components to form distributed or parallel processing systems. Consideration is given to the selection of the host microprocessor, increasing the power and/or memory capacity of the host, multitasking software for the host, array processors to reduce computation time, the allocation of real-time and non-real-time events to different computer subsystems, intelligent devices to share the computational burden for real-time events, and intelligent interfaces to increase communication speeds. The case of a helicopter vibration-suppression and stabilization controller is analyzed as an example, and significant improvements in computation and throughput rates are demonstrated.
Computer-Aided Design of Organic Host Architectures for Selective Chemosensors
Hay, Benjamin; Bryantsev, Vyacheslav S.
2009-01-01
Selective organic hosts provide the foundation for the development of many types of sensors. The deliberate design of host molecules with predetermined selectivity, however, remains a challenge in supramolecular chemistry. To address this issue we have developed a de novo structure-based design approach for the unbiased construction of complementary host architectures. This chapter summarizes recent progress including improvements on a computer software program, HostDesigner, specifically tailored to discover host architectures for small guest molecules. HostDesigner is capable of generating and evaluating millions of candidate structures in minutes on a desktop personal computer, allowing a user to rapidly identify three-dimensional architectures that are structurally organized for binding a targeted guest species. The efficacy of this computational methodology is illustrated with a search for cation hosts containing aliphatic ether oxygen groups and anion hosts containing urea groups.
A Survey of Architectural Techniques for Near-Threshold Computing
Mittal, Sparsh
2015-12-28
Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. In low-voltage computing and specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate and performance loss etc. Our paper surveys several re- cent techniques which aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlightmore » their similarities and differences. Ultimately, we hope that this paper will provide insights into state-of-art NTC techniques to researchers and system-designers and inspire further research in this field.« less
A Survey of Architectural Techniques for Near-Threshold Computing
Mittal, Sparsh
2015-12-28
Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. In low-voltage computing and specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate and performance loss etc. Our paper surveys several re- cent techniques which aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlight their similarities and differences. Ultimately, we hope that this paper will provide insights into state-of-art NTC techniques to researchers and system-designers and inspire further research in this field.
Final Report: Super Instruction Architecture for Scalable Parallel Computations
Sanders, Beverly Ann; Bartlett, Rodney; Deumens, Erik
2013-12-23
The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.
A language comparison for scientific computing on MIMD architectures
NASA Technical Reports Server (NTRS)
Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.
1989-01-01
Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.
Initial condition for efficient mapping of level set algorithms on many-core architectures
NASA Astrophysics Data System (ADS)
Tornai, Gábor János; Cserey, György
2014-12-01
In this paper, we investigated the effect of adding more small curves to the initial condition which determines the required number of iterations of a fast level set (LS) evolution. As a result, we discovered two new theorems and developed a proof on the worst case of the required number of iterations. Furthermore, we found that these kinds of initial conditions fit well to many-core architectures. To show this, we have included two case studies which are presented on different platforms. One runs on a graphical processing unit (GPU) and the other is executed on a cellular nonlinear network universal machine (CNN-UM). With the new initial conditions, the steady-state solutions of the LS are reached in less than eight iterations depending on the granularity of the initial condition. These dense iterations can be calculated very quickly on many-core platforms according to the two case studies. In the case of the proposed dense initial condition on GPU, there is a significant speedup compared to the sparse initial condition in all cases since our dense initial condition together with the algorithm utilizes the properties of the underlying architecture. Therefore, greater performance gain can be achieved (up to 18 times speedup compared to the sparse initial condition on GPU). Additionally, we have validated our concept against numerically approximated LS evolution of standard flows (mean curvature, Chan-Vese, geodesic active regions). The dice indexes between the fast LS evolutions and the evolutions of the numerically approximated partial differential equations are in the range of 0.99±0.003.
An evaluation of computer assisted clinical classification algorithms.
Chute, C G; Yang, Y; Buntrock, J
1994-01-01
The Mayo Clinic has a long tradition of indexing patient records in high resolution and volume. Several algorithms have been developed which promise to help human coders in the classification process. We evaluate variations on code browsers and free text indexing systems with respect to their speed and error rates in our production environment. The more sophisticated indexing systems save measurable time in the coding process, but suffer from incompleteness which requires a back-up system or human verification. Expert Network does the best job of rank ordering clinical text, potentially enabling the creation of thresholds for the pass through of computer coded data without human review. PMID:7949912
A multitasking finite state architecture for computer control of an electric powertrain
Burba, J.C.
1984-01-01
Finite state techniques provide a common design language between the control engineer and the computer engineer for event driven computer control systems. They simplify communication and provide a highly maintainable control system understandable by both. This paper describes the development of a control system for an electric vehicle powertrain utilizing finite state concepts. The basics of finite state automata are provided as a framework to discuss a unique multitasking software architecture developed for this application. The architecture employs conventional time-sliced techniques with task scheduling controlled by a finite state machine representation of the control strategy of the powertrain. The complexities of excitation variable sampling in this environment are also considered.
A comparison of computer architectures for the NASA demonstration advanced avionics system
NASA Technical Reports Server (NTRS)
Seacord, C. L.; Bailey, D. G.; Larson, J. C.
1979-01-01
The paper compares computer architectures for the NASA demonstration advanced avionics system. Two computer architectures are described with an unusual approach to fault tolerance: a single spare processor can correct for faults in any of the distributed processors by taking on the role of a failed module. It was shown the system must be used from a functional point of view to properly apply redundancy and achieve fault tolerance and ultra reliability. Data are presented on complexity and mission failure probability which show that the revised version offers equivalent mission reliability at lower cost as measured by hardware and software complexity.
Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; Maris, Pieter; Vary, James P.
2015-07-14
In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important features of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.
Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; Maris, Pieter; Vary, James P.
2015-07-14
In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
An Architectural Design System Based on Computer Graphics.
ERIC Educational Resources Information Center
MacDonald, Stephen L.; Wehrli, Robert
The recent developments in computer hardware and software are presented to inform architects of this design tool. Technical advancements in equipment include--(1) cathode ray tube displays, (2) light pens, (3) print-out and photo copying attachments, (4) controls for comparison and selection of images, (5) chording keyboards, (6) plotters, and (7)…
COMPUTER ARCHITECTURE FOR RESEARCH IN METEOROLOGY AND ATMOSPHERIC CHEMISTRY
The study examines the feasibility of constructing a peripheral hardware module that could be attached to a mini or midsized computer to accelerate the execution of large air pollution models, such as the EPA's Regional Oxidant Model (ROM). Crucial information necessary to design...
CSP: A Multifaceted Hybrid Architecture for Space Computing
NASA Technical Reports Server (NTRS)
Rudolph, Dylan; Wilson, Christopher; Stewart, Jacob; Gauvin, Patrick; George, Alan; Lam, Herman; Crum, Gary Alex; Wirthlin, Mike; Wilson, Alex; Stoddard, Aaron
2014-01-01
Research on the CHREC Space Processor (CSP) takes a multifaceted hybrid approach to embedded space computing. Working closely with the NASA Goddard SpaceCube team, researchers at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Florida and Brigham Young University are developing hybrid space computers that feature an innovative combination of three technologies: commercial-off-the-shelf (COTS) devices, radiation-hardened (RadHard) devices, and fault-tolerant computing. Modern COTS processors provide the utmost in performance and energy-efficiency but are susceptible to ionizing radiation in space, whereas RadHard processors are virtually immune to this radiation but are more expensive, larger, less energy-efficient, and generations behind in speed and functionality. By featuring COTS devices to perform the critical data processing, supported by simpler RadHard devices that monitor and manage the COTS devices, and augmented with novel uses of fault-tolerant hardware, software, information, and networking within and between COTS devices, the resulting system can maximize performance and reliability while minimizing energy consumption and cost. NASA Goddard has adopted the CSP concept and technology with plans underway to feature flight-ready CSP boards on two upcoming space missions.
NASA Astrophysics Data System (ADS)
Kovalev, I. V.; Zelenkov, P. V.; Karaseva, M. V.; Tsarev, M. Yu; Tsarev, R. Yu
2015-01-01
The paper considers the problem of the analysis of distributed computer systems reliability with client-server architecture. A distributed computer system is a set of hardware and software for implementing the following main functions: processing, storage, transmission and data protection. This paper discusses the distributed computer systems architecture "client-server". The paper presents the scheme of the distributed computer system functioning represented as a graph where vertices are the functional state of the system and arcs are transitions from one state to another depending on the prevailing conditions. In reliability analysis we consider such reliability indicators as the probability of the system transition in the stopping state and accidents, as well as the intensity of these transitions. The proposed model allows us to obtain correlations for the reliability parameters of the distributed computer system without any assumptions about the distribution laws of random variables and the elements number in the system.
SAR data exploitation: computational technology enabling SAR ATR algorithm development
NASA Astrophysics Data System (ADS)
Majumder, Uttam K.; Casteel, Curtis H., Jr.; Buxa, Peter; Minardi, Michael J.; Zelnio, Edmund G.; Nehrbass, John W.
2007-04-01
A fundamental issue with synthetic aperture radar (SAR) application development is data processing and exploitation in real-time or near real-time. The power of high performance computing (HPC) clusters, FPGA, and the IBM Cell processor presents new algorithm development possibilities that have not been fully leveraged. In this paper, we will illustrate the capability of SAR data exploitation which was impractical over the last decade due to computing limitations. We can envision that SAR imagery encompassing city size coverage at extremely high levels of fidelity could be processed at near-real time using the above technologies to empower the warfighter with access to critical information for the war on terror, homeland defense, as well as urban warfare.
Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment
NASA Technical Reports Server (NTRS)
Duong, Tuan A.; Duong, Vu A.
2012-01-01
A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.
Efficient computer algebra algorithms for polynomial matrices in control design
NASA Technical Reports Server (NTRS)
Baras, J. S.; Macenany, D. C.; Munach, R.
1989-01-01
The theory of polynomial matrices plays a key role in the design and analysis of multi-input multi-output control and communications systems using frequency domain methods. Examples include coprime factorizations of transfer functions, cannonical realizations from matrix fraction descriptions, and the transfer function design of feedback compensators. Typically, such problems abstract in a natural way to the need to solve systems of Diophantine equations or systems of linear equations over polynomials. These and other problems involving polynomial matrices can in turn be reduced to polynomial matrix triangularization procedures, a result which is not surprising given the importance of matrix triangularization techniques in numerical linear algebra. Matrices with entries from a field and Gaussian elimination play a fundamental role in understanding the triangularization process. In the case of polynomial matrices, matrices with entries from a ring for which Gaussian elimination is not defined and triangularization is accomplished by what is quite properly called Euclidean elimination. Unfortunately, the numerical stability and sensitivity issues which accompany floating point approaches to Euclidean elimination are not very well understood. New algorithms are presented which circumvent entirely such numerical issues through the use of exact, symbolic methods in computer algebra. The use of such error-free algorithms guarantees that the results are accurate to within the precision of the model data--the best that can be hoped for. Care must be taken in the design of such algorithms due to the phenomenon of intermediate expressions swell.
A computational algorithm to characterize the geomorphology of hillslopes
NASA Astrophysics Data System (ADS)
Noel, P.; Rousseau, A.; Paniconi, C.
2009-12-01
A computational algorithm has been integrated into a watershed GIS (PHYSITEL software) to characterize the geomorphology of hillslopes from DEMs or relatively homogeneous hydrological units (RHHUs). The procedure begins by dividing RHHUs into either two or three hillslopes, the latter case representative of a headwater unit. Then, the plan shape and profile curvature of each hillslope are computed using either a binary or a fuzzy logic approach to derive the 3D shapes according to Dikau (1989). From these parameters elementary shapes are created to extract a width function for each hillslope. Different criteria, such as total area matching, are used to guide and assess the delineation process. In contrast to other partitioning algorithms (e.g., LandMapR software), the proposed method works within an HU framework and can provide the data needed for existing distributed hydrological models such as HYDROTEL and evolving coupled hillslope/groundwater models based on Boussinesq or kinematic flow paradigms. An application to test this procedure is undergoing for the des Anglais River basin in southwestern Quebec.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Not Available
1994-09-01
This third volume of the Information Management Architecture for an Integrated Computing Environment for the Environmental Restoration Program--the Interim Technical Architecture (TA) (referred to throughout the remainder of this document as the ER TA)--represents a key milestone in establishing a coordinated information management environment in which information initiatives can be pursued with the confidence that redundancy and inconsistencies will be held to a minimum. This architecture is intended to be used as a reference by anyone whose responsibilities include the acquisition or development of information technology for use by the ER Program. The interim ER TA provides technical guidance at three levels. At the highest level, the technical architecture provides an overall computing philosophy or direction. At this level, the guidance does not address specific technologies or products but addresses more general concepts, such as the use of open systems, modular architectures, graphical user interfaces, and architecture-based development. At the next level, the technical architecture provides specific information technology recommendations regarding a wide variety of specific technologies. These technologies include computing hardware, operating systems, communications software, database management software, application development software, and personal productivity software, among others. These recommendations range from the adoption of specific industry or Martin Marietta Energy Systems, Inc. (Energy Systems) standards to the specification of individual products. At the third level, the architecture provides guidance regarding implementation strategies for the recommended technologies that can be applied to individual projects and to the ER Program as a whole.
Towards automatic Markov reliability modeling of computer architectures
NASA Technical Reports Server (NTRS)
Liceaga, C. A.; Siewiorek, D. P.
1986-01-01
The analysis and evaluation of reliability measures using time-varying Markov models is required for Processor-Memory-Switch (PMS) structures that have competing processes such as standby redundancy and repair, or renewal processes such as transient or intermittent faults. The task of generating these models is tedious and prone to human error due to the large number of states and transitions involved in any reasonable system. Therefore model formulation is a major analysis bottleneck, and model verification is a major validation problem. The general unfamiliarity of computer architects with Markov modeling techniques further increases the necessity of automating the model formulation. This paper presents an overview of the Automated Reliability Modeling (ARM) program, under development at NASA Langley Research Center. ARM will accept as input a description of the PMS interconnection graph, the behavior of the PMS components, the fault-tolerant strategies, and the operational requirements. The output of ARM will be the reliability of availability Markov model formulated for direct use by evaluation programs. The advantages of such an approach are (a) utility to a large class of users, not necessarily expert in reliability analysis, and (b) a lower probability of human error in the computation.
Usage of Thin-Client/Server Architecture in Computer Aided Education
ERIC Educational Resources Information Center
Cimen, Caghan; Kavurucu, Yusuf; Aydin, Halit
2014-01-01
With the advances of technology, thin-client/server architecture has become popular in multi-user/single network environments. Thin-client is a user terminal in which the user can login to a domain and run programs by connecting to a remote server. Recent developments in network and hardware technologies (cloud computing, virtualization, etc.)…
p88110: A Graphical Simulator for Computer Architecture and Organization Courses
ERIC Educational Resources Information Center
Garcia, M. I.; Rodriguez, S.; Perez, A.; Garcia, A.
2009-01-01
Studying fundamental Computer Architecture and Organization topics requires a significant amount of practical work if students are to acquire a good grasp of the theoretical concepts presented in classroom lectures or textbooks. The use of simulators is commonly adopted in order to reach this objective. However, as most of the available…
ERIC Educational Resources Information Center
Hung, Y.-C.
2012-01-01
This paper investigates the impact of combining self explaining (SE) with computer architecture diagrams to help novice students learn assembly language programming. Pre- and post-test scores for the experimental and control groups were compared and subjected to covariance (ANCOVA) statistical analysis. Results indicate that the SE-plus-diagram…
Cloud identification using genetic algorithms and massively parallel computation
NASA Technical Reports Server (NTRS)
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
Programming parallel vision algorithms
Shapiro, L.G.
1988-01-01
Computer vision requires the processing of large volumes of data and requires parallel architectures and algorithms to be useful in real-time, industrial applications. The INSIGHT dataflow language was designed to allow encoding of vision algorithms at all levels of the computer vision paradigm. INSIGHT programs, which are relational in nature, can be translated into a graph structure that represents an architecture for solving a particular vision problem or a configuration of a reconfigurable computational network. The authors consider here INSIGHT programs that produce a parallel net architecture for solving low-, mid-, and high-level vision tasks.
Emotions are emergent processes: they require a dynamic computational architecture
Scherer, Klaus R.
2009-01-01
Emotion is a cultural and psychobiological adaptation mechanism which allows each individual to react flexibly and dynamically to environmental contingencies. From this claim flows a description of the elements theoretically needed to construct a virtual agent with the ability to display human-like emotions and to respond appropriately to human emotional expression. This article offers a brief survey of the desirable features of emotion theories that make them ideal blueprints for agent models. In particular, the component process model of emotion is described, a theory which postulates emotion-antecedent appraisal on different levels of processing that drive response system patterning predictions. In conclusion, investing seriously in emergent computational modelling of emotion using a nonlinear dynamic systems approach is suggested. PMID:19884141
An Object-Oriented Network-Centric Software Architecture for Physical Computing
NASA Astrophysics Data System (ADS)
Palmer, Richard
1997-08-01
Recent developments in object-oriented computer languages and infrastructure such as the Internet, Web browsers, and the like provide an opportunity to define a more productive computational environment for scientific programming that is based more closely on the underlying mathematics describing physics than traditional programming languages such as FORTRAN or C++. In this talk I describe an object-oriented software architecture for representing physical problems that includes classes for such common mathematical objects as geometry, boundary conditions, partial differential and integral equations, discretization and numerical solution methods, etc. In practice, a scientific program written using this architecture looks remarkably like the mathematics used to understand the problem, is typically an order of magnitude smaller than traditional FORTRAN or C++ codes, and hence easier to understand, debug, describe, etc. All objects in this architecture are ``network-enabled,'' which means that components of a software solution to a physical problem can be transparently loaded from anywhere on the Internet or other global network. The architecture is expressed as an ``API,'' or application programmers interface specification, with reference embeddings in Java, Python, and C++. A C++ class library for an early version of this API has been implemented for machines ranging from PC's to the IBM SP2, meaning that phidentical codes run on all architectures.
Advanced entry guidance algorithm with landing footprint computation
NASA Astrophysics Data System (ADS)
Leavitt, James Aaron
The design and performance evaluation of an entry guidance algorithm for future space transportation vehicles is presented. The algorithm performs two functions: on-board trajectory planning and trajectory tracking. The planned longitudinal path is followed by tracking drag acceleration, as is done by the Space Shuttle entry guidance. Unlike the Shuttle entry guidance, lateral path curvature is also planned and followed. A new trajectory planning function for the guidance algorithm is developed that is suitable for suborbital entry and that significantly enhances the overall performance of the algorithm for both orbital and suborbital entry. In comparison with the previous trajectory planner, the new planner produces trajectories that are easier to track, especially near the upper and lower drag boundaries and for suborbital entry. The new planner accomplishes this by matching the vehicle's initial flight path angle and bank angle, and by enforcing the full three-degree-of-freedom equations of motion with control derivative limits. Insights gained from trajectory optimization results contribute to the design of the new planner, giving it near-optimal downrange and crossrange capabilities. Planned trajectories and guidance simulation results are presented that demonstrate the improved performance. Based on the new planner, a method is developed for approximating the landing footprint for entry vehicles in near real-time, as would be needed for an on-board flight management system. The boundary of the footprint is constructed from the endpoints of extreme downrange and crossrange trajectories generated by the new trajectory planner. The footprint algorithm inherently possesses many of the qualities of the new planner, including quick execution, the ability to accurately approximate the vehicle's glide capabilities, and applicability to a wide range of entry conditions. Footprints can be generated for orbital and suborbital entry conditions using a pre
pacce: Perl algorithm to compute continuum and equivalent widths
NASA Astrophysics Data System (ADS)
Riffel, Rogério; Borges Vale, Tibério
2011-08-01
We present Perl Algorithm to Compute continuum and Equivalent Widths ( pacce). We describe the methods used in the computations and the requirements for its usage. We compare the measurements made with pacce and "manual" ones made using iraf splot task. These tests show that for synthetic simple stellar population (SSP) models the equivalent widths strengths are very similar (differences ≲0.2 Å) for both measurements. In real stellar spectra, the correlation between both values is still very good, but with differences of up to 0.5 Å. pacce is also able to determine mean continuum and continuum at line center values, which are helpful in stellar population studies. In addition, it is also able to compute the uncertainties in the equivalent widths using photon statistics. The code is made available for the community through the web at
Recent Algorithmic and Computational Efficiency Improvements in the NIMROD Code
NASA Astrophysics Data System (ADS)
Plimpton, S. J.; Sovinec, C. R.; Gianakon, T. A.; Parker, S. E.
1999-11-01
Extreme anisotropy and temporal stiffness impose severe challenges to simulating low frequency, nonlinear behavior in magnetized fusion plasmas. To address these challenges in computations of realistic experiment configurations, NIMROD(Glasser, et al., Plasma Phys. Control. Fusion 41) (1999) A747. uses a time-split, semi-implicit advance of the two-fluid equations for magnetized plasmas with a finite element/Fourier series spatial representation. The stiffness and anisotropy lead to ill-conditioned linear systems of equations, and they emphasize any truncation errors that may couple different modes of the continuous system. Recent work significantly improves NIMROD's performance in these areas. Implementing a parallel global preconditioning scheme in structured-grid regions permits scaling to large problems and large time steps, which are critical for achieving realistic S-values. In addition, coupling to the AZTEC parallel linear solver package now permits efficient computation with regions of unstructured grid. Changes in the time-splitting scheme improve numerical behavior in simulations with strong flow, and quadratic basis elements are being explored for accuracy. Different numerical forms of anisotropic thermal conduction, critical for slow island evolution, are compared. Algorithms for including gyrokinetic ions in the finite element computations are discussed.
ERIC Educational Resources Information Center
Society for Industrial and Applied Mathematics, Philadelphia, PA.
The critical role of computers in scientific advancement is described in this panel report. With the growing range and complexity of problems that must be solved and with demands of new generations of computers and computer architecture, the importance of computational mathematics is increasing. Multidisciplinary teams are needed; these are found…
Architecture for computer vision application development within the HORUS system
NASA Astrophysics Data System (ADS)
Eckstein, Wolfgang; Steger, Carsten T.
1997-04-01
An integrated program development environment for computer vision tasks is presented. The first component of the system is concerned with the visualization of 2D image data. This is done in an object-oriented manner. Programming of the visualization process is achieved by arranging the representations of iconic data in an interactively customizable hierarchy that establishes an intuitive flow of messages between data representations seen as objects. The visualization objects called displays, are designed for different levels of abstraction, starting from direct iconic representation down to numerical features, depending on the information needed. Two types of messages are passed between these displays, which yield a clear and intuitive semantics. The second component of the system is an interactive tool for rapid program development. It helps the user in selecting appropriate operators in many ways. For example, the system provides context sensitive selection of possible alternative operators as well as suitable successors and required predecessors. For the task of choosing appropriate parameters several alternatives exist. For example, the system provides default values as well as lists of useful values for al parameters of each operator. To achieve this, a knowledge base containing facts about the operators and their parameters is used. Second, through the tight coupling of the two system components, parameters can be determined quickly by data exploration within the visualization components.
Synchronized computational architecture for generalized bilateral control of robot arms
NASA Technical Reports Server (NTRS)
Szakaly, Zoltan F. (Inventor)
1991-01-01
A master six degree of freedom Force Reflecting Hand Controller (FRHC) is available at a master site where a received image displays, in essentially real time, a remote robotic manipulator which is being controlled in the corresponding six degree freedom by command signals which are transmitted to the remote site in accordance with the movement of the FRHC at the master site. Software is user-initiated at the master site in order to establish the basic system conditions, and then a physical movement of the FRHC in Cartesean space is reflected at the master site by six absolute numbers that are sensed, translated and computed as a difference signal relative to the earlier position. The change in position is then transmitted in that differential signal form over a high speed synchronized bilateral communication channel which simultaneously returns robot-sensed response information to the master site as forces applied to the FRHC so that the FRHC reflects the feel of what is taking place at the remote site. A system wide clock rate is selected at a sufficiently high rate that the operator at the master site experiences the Force Reflecting operation in real time.
Adaptation of the anelastic solver EULAG to high performance computing architectures.
NASA Astrophysics Data System (ADS)
Wójcik, Damian; Ciżnicki, Miłosz; Kopta, Piotr; Kulczewski, Michał; Kurowski, Krzysztof; Piotrowski, Zbigniew; Rojek, Krzysztof; Rosa, Bogdan; Szustak, Łukasz; Wyrzykowski, Roman
2014-05-01
In recent years there has been widespread interest in employing heterogeneous and hybrid supercomputing architectures for geophysical research. Especially promising application for the modern supercomputing architectures is the numerical weather prediction (NWP). Adopting traditional NWP codes to the new machines based on multi- and many-core processors, such as GPUs allows to increase computational efficiency and decrease energy consumption. This offers unique opportunity to develop simulations with finer grid resolutions and computational domains larger than ever before. Further, it enables to extend the range of scales represented in the model so that the accuracy of representation of the simulated atmospheric processes can be improved. Consequently, it allows to improve quality of weather forecasts. Coalition of Polish scientific institutions launched a project aimed at adopting EULAG fluid solver for future high-performance computing platforms. EULAG is currently being implemented as a new dynamical core of COSMO Consortium weather prediction framework. The solver code combines features of a stencil and point wise computations. Its communication scheme consists of both halo exchange subroutines and global reduction functions. Within the project, two main modules of EULAG, namely MPDATA advection and iterative GCR elliptic solver are analyzed and optimized. Relevant techniques have been chosen and applied to accelerate code execution on modern HPC architectures: stencil decomposition, block decomposition (with weighting analysis between computation and communication), reduction of inter-cache communication by partitioning of cores into independent teams, cache reusing and vectorization. Experiments with matching computational domain topology to cluster topology are performed as well. The parallel formulation was extended from pure MPI to hybrid MPI - OpenMP approach. Porting to GPU using CUDA directives is in progress. Preliminary results of performance of the
NASA Technical Reports Server (NTRS)
Forman, P.; Moses, K.
1979-01-01
A brief description of a SIFT (Software Implemented Fault Tolerance) Flight Control Computer with emphasis on implementation is presented. A multiprocessor system that relies on software-implemented fault detection and reconfiguration algorithms is described. A high level reliability and fault tolerance is achieved by the replication of computing tasks among processing units.
A Cerebellar Neuroprosthetic System: Computational Architecture and in vivo Test.
Herreros, Ivan; Giovannucci, Andrea; Taub, Aryeh H; Hogri, Roni; Magal, Ari; Bamford, Sim; Prueckl, Robert; Verschure, Paul F M J
2014-01-01
Emulating the input-output functions performed by a brain structure opens the possibility for developing neuroprosthetic systems that replace damaged neuronal circuits. Here, we demonstrate the feasibility of this approach by replacing the cerebellar circuit responsible for the acquisition and extinction of motor memories. Specifically, we show that a rat can undergo acquisition, retention, and extinction of the eye-blink reflex even though the biological circuit responsible for this task has been chemically inactivated via anesthesia. This is achieved by first developing a computational model of the cerebellar microcircuit involved in the acquisition of conditioned reflexes and training it with synthetic data generated based on physiological recordings. Secondly, the cerebellar model is interfaced with the brain of an anesthetized rat, connecting the model's inputs and outputs to afferent and efferent cerebellar structures. As a result, we show that the anesthetized rat, equipped with our neuroprosthetic system, can be classically conditioned to the acquisition of an eye-blink response. However, non-stationarities in the recorded biological signals limit the performance of the cerebellar model. Thus, we introduce an updated cerebellar model and validate it with physiological recordings showing that learning becomes stable and reliable. The resulting system represents an important step toward replacing lost functions of the central nervous system via neuroprosthetics, obtained by integrating a synthetic circuit with the afferent and efferent pathways of a damaged brain region. These results also embody an early example of science-based medicine, where on the one hand the neuroprosthetic system directly validates a theory of cerebellar learning that informed the design of the system, and on the other one it takes a step toward the development of neuro-prostheses that could recover lost learning functions in animals and, in the longer term, humans. PMID:25152887
A Cerebellar Neuroprosthetic System: Computational Architecture and in vivo Test
Herreros, Ivan; Giovannucci, Andrea; Taub, Aryeh H.; Hogri, Roni; Magal, Ari; Bamford, Sim; Prueckl, Robert; Verschure, Paul F. M. J.
2014-01-01
Emulating the input–output functions performed by a brain structure opens the possibility for developing neuroprosthetic systems that replace damaged neuronal circuits. Here, we demonstrate the feasibility of this approach by replacing the cerebellar circuit responsible for the acquisition and extinction of motor memories. Specifically, we show that a rat can undergo acquisition, retention, and extinction of the eye-blink reflex even though the biological circuit responsible for this task has been chemically inactivated via anesthesia. This is achieved by first developing a computational model of the cerebellar microcircuit involved in the acquisition of conditioned reflexes and training it with synthetic data generated based on physiological recordings. Secondly, the cerebellar model is interfaced with the brain of an anesthetized rat, connecting the model’s inputs and outputs to afferent and efferent cerebellar structures. As a result, we show that the anesthetized rat, equipped with our neuroprosthetic system, can be classically conditioned to the acquisition of an eye-blink response. However, non-stationarities in the recorded biological signals limit the performance of the cerebellar model. Thus, we introduce an updated cerebellar model and validate it with physiological recordings showing that learning becomes stable and reliable. The resulting system represents an important step toward replacing lost functions of the central nervous system via neuroprosthetics, obtained by integrating a synthetic circuit with the afferent and efferent pathways of a damaged brain region. These results also embody an early example of science-based medicine, where on the one hand the neuroprosthetic system directly validates a theory of cerebellar learning that informed the design of the system, and on the other one it takes a step toward the development of neuro-prostheses that could recover lost learning functions in animals and, in the longer term, humans. PMID:25152887
Computer implementation of the Kerov-Kirillov-Reshetikhin algorithm
NASA Astrophysics Data System (ADS)
Jakubczyk, P.; Wal, A.; Jakubczyk, D.; Lulek, T.
2012-06-01
The Kerov-Kirillov-Reshetikhin (KKR) algorithm establishes a bijection between semistandard Weyl tableaux of the shape λ and weight μ and rigged string configurations of type (λ,μ). This algorithm can be applied to Heisenberg magnetic chains and their generalisations by use of Schur-Weyl duality, and gives a way to classify all eigenstates of the chain in a methodological manner. Program summaryProgram title: KKR algorithm Catalogue identifier: AELQ_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AELQ_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 974 No. of bytes in distributed program, including test data, etc.: 11 118 Distribution format: tar.gz Programming language: Maple Computer: PC Operating system: Windows, Linux RAM: 1 GB Classification: 2.7, 4.2 Nature of problem: The method of classification of all eigenstates and eigenvalues is given for the generalised model of the one-dimensional Heisenberg magnet consisting of N nodes. Solution method: We use a prolongation of the RSK bijection [1] to the set of all eigenstates of an integrable model, in terms of rigged string configurations. More specifically, one selects from the irreducible basis b of the Schur-Weyl duality all highest weight states |λty> with respect to the unitary group U(n) and generates a rigged string configurations |ν→,{J}>=KKR(|λty>) for each y∈SYT(λ). Running time: The running time is about 1 ms.
NASA Astrophysics Data System (ADS)
Ishikawa, Akio; Kishi, Yoji
2000-09-01
This paper newly proposes a self-healing architecture in all- optical WDM networks based on virtual embedded multiple rings (Virtual Multiple Self Healing Rings: VM-SHR). Focusing upon the network design aspect of the proposed architecture, this paper describes design methodologies for VM-SHR networks. For two major problems in all-optical WDM network design, that is, the connection routing and wavelength assignment problems, we first established solution models based on mathematical programming formulation, each of which can be solved by common integer programming algorithms, respectively. In addition, we also developed an efficient heuristic algorithm for the wavelength assignment problem. Their usefulness and performance are demonstrated through the extensive simulation results.
Paralel Multiphysics Algorithms and Software for Computational Nuclear Engineering
D. Gaston; G. Hansen; S. Kadioglu; D. A. Knoll; C. Newman; H. Park; C. Permann; W. Taitano
2009-08-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple 'code coupling' or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
Parallel multiphysics algorithms and software for computational nuclear engineering
NASA Astrophysics Data System (ADS)
Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.
2009-07-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
Computation of Transient Nonlinear Ship Waves Using AN Adaptive Algorithm
NASA Astrophysics Data System (ADS)
Çelebi, M. S.
2000-04-01
An indirect boundary integral method is used to solve transient nonlinear ship wave problems. A resulting mixed boundary value problem is solved at each time-step using a mixed Eulerian- Lagrangian time integration technique. Two dynamic node allocation techniques, which basically distribute nodes on an ever changing body surface, are presented. Both two-sided hyperbolic tangent and variational grid generation algorithms are developed and compared on station curves. A ship hull form is generated in parametric space using a B-spline surface representation. Two-sided hyperbolic tangent and variational adaptive curve grid-generation methods are then applied on the hull station curves to generate effective node placement. The numerical algorithm, in the first method, used two stretching parameters. In the second method, a conservative form of the parametric variational Euler-Lagrange equations is used the perform an adaptive gridding on each station. The resulting unsymmetrical influence coefficient matrix is solved using both a restarted version of GMRES based on the modified Gram-Schmidt procedure and a line Jacobi method based on LU decomposition. The convergence rates of both matrix iteration techniques are improved with specially devised preconditioners. Numerical examples of node placements on typical hull cross-sections using both techniques are discussed and fully nonlinear ship wave patterns and wave resistance computations are presented.
CRADA ORNL 91-0046B final report: Assessment of IBM advanced computing architectures
Geist, G.A.
1996-02-01
This was a Cooperative Research and Development Agreement (CRADA) with IBM to assess their advanced computer architectures. Over the course of this project three different architectures were evaluated. The POWER/4 RIOS1 based shared memory multiprocessor, the POWER/2 RIOS2 based high performance workstation, and the J30 PowerPC based shared memory multiprocessor. In addition to this hardware several software packages where beta tested for IBM including: ESSO scientific computing library, nv video-conferencing package, Ultimedia multimedia display environment, FORTRAN 90 and C++ compilers, and the AIX 4.1 operating system. Both IBM and ORNL benefited from the research performed in this project and even though access to the POWER/4 computer was delayed several months, all milestones were met.
Scalable quantum computer architecture with coupled donor-quantum dot qubits
Schenkel, Thomas; Lo, Cheuk Chi; Weis, Christoph; Lyon, Stephen; Tyryshkin, Alexei; Bokor, Jeffrey
2014-08-26
A quantum bit computing architecture includes a plurality of single spin memory donor atoms embedded in a semiconductor layer, a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, wherein a first voltage applied across at least one pair of the aligned quantum dot and donor atom controls a donor-quantum dot coupling. A method of performing quantum computing in a scalable architecture quantum computing apparatus includes arranging a pattern of single spin memory donor atoms in a semiconductor layer, forming a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, applying a first voltage across at least one aligned pair of a quantum dot and donor atom to control a donor-quantum dot coupling, and applying a second voltage between one or more quantum dots to control a Heisenberg exchange J coupling between quantum dots and to cause transport of a single spin polarized electron between quantum dots.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
NASA Technical Reports Server (NTRS)
Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.
1973-01-01
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
Integrating Computer Architectures into the Design of High-Performance Controllers
NASA Technical Reports Server (NTRS)
Jacklin, Stephen A.; Leyland, Jane A.; Warmbrodt, William
1986-01-01
Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, on-line graphics, and file management. This paper discusses five global design considerations that are useful to integrate array processor, multimicroprocessor, and host computer system architecture into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the non-real-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration will be briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind-tunnel environment, the control architecture can generally be applied to a wide range of automatic control applications.
NASA Astrophysics Data System (ADS)
Liu, Lei; Hong, Xiaobin; Wu, Jian; Lin, Jintong
As Grid computing continues to gain popularity in the industry and research community, it also attracts more attention from the customer level. The large number of users and high frequency of job requests in the consumer market make it challenging. Clearly, all the current Client/Server(C/S)-based architecture will become unfeasible for supporting large-scale Grid applications due to its poor scalability and poor fault-tolerance. In this paper, based on our previous works [1, 2], a novel self-organized architecture to realize a highly scalable and flexible platform for Grids is proposed. Experimental results show that this architecture is suitable and efficient for consumer-oriented Grids.
Equation-free multiscale computation: algorithms and applications.
Kevrekidis, Ioannis G; Samaey, Giovanni
2009-01-01
In traditional physicochemical modeling, one derives evolution equations at the (macroscopic, coarse) scale of interest; these are used to perform a variety of tasks (simulation, bifurcation analysis, optimization) using an arsenal of analytical and numerical techniques. For many complex systems, however, although one observes evolution at a macroscopic scale of interest, accurate models are only given at a more detailed (fine-scale, microscopic) level of description (e.g., lattice Boltzmann, kinetic Monte Carlo, molecular dynamics). Here, we review a framework for computer-aided multiscale analysis, which enables macroscopic computational tasks (over extended spatiotemporal scales) using only appropriately initialized microscopic simulation on short time and length scales. The methodology bypasses the derivation of macroscopic evolution equations when these equations conceptually exist but are not available in closed form-hence the term equation-free. We selectively discuss basic algorithms and underlying principles and illustrate the approach through representative applications. We also discuss potential difficulties and outline areas for future research. PMID:19335220
Delta: An object-oriented finite element code architecture for massively parallel computers
Weatherby, J.R.; Schutt, J.A.; Peery, J.S.; Hogan, R.E.
1996-02-01
Delta is an object-oriented code architecture based on the finite element method which enables simulation of a wide range of engineering mechanics problems in a parallel processing environment. Written in C{sup ++}, Delta is a natural framework for algorithm development and for research involving coupling of mechanics from different Engineering Science disciplines. To enhance flexibility and encourage code reuse, the architecture provides a clean separation of the major aspects of finite element programming. Spatial discretization, temporal discretization, and the solution of linear and nonlinear systems of equations are each implemented separately, independent from the governing field equations. Other attractive features of the Delta architecture include support for constitutive models with internal variables, reusable ``matrix-free`` equation solvers, and support for region-to-region variations in the governing equations and the active degrees of freedom. A demonstration code built from the Delta architecture has been used in two-dimensional and three-dimensional simulations involving dynamic and quasi-static solid mechanics, transient and steady heat transport, and flow in porous media.
Trapezoidal rule quadrature algorithms for MIMD distributed memory computers
Lyness, J.N.; Plowman, S.E.
1994-08-01
An approach to multi-dimensional quadrature, designed to exploit parallel architectures, is described. This involves transforming the integral in such a way that an accurate result is given by the trapezoidal rule; and by evaluating the resulting sum in a manner which may be efficiently implemented on parallel architectures. This approach is to be implemented in the Liverpool NAG transputer library.
CPU SIM: A Computer Simulator for Use in an Introductory Computer Organization-Architecture Class.
ERIC Educational Resources Information Center
Skrein, Dale
1994-01-01
CPU SIM, an interactive low-level computer simulation package that runs on the Macintosh computer, is described. The program is designed for instructional use in the first or second year of undergraduate computer science, to teach various features of typical computer organization through hands-on exercises. (MSE)
Efficient conjugate gradient algorithms for computation of the manipulator forward dynamics
NASA Technical Reports Server (NTRS)
Fijany, Amir; Scheid, Robert E.
1989-01-01
The applicability of conjugate gradient algorithms for computation of the manipulator forward dynamics is investigated. The redundancies in the previously proposed conjugate gradient algorithm are analyzed. A new version is developed which, by avoiding these redundancies, achieves a significantly greater efficiency. A preconditioned conjugate gradient algorithm is also presented. A diagonal matrix whose elements are the diagonal elements of the inertia matrix is proposed as the preconditioner. In order to increase the computational efficiency, an algorithm is developed which exploits the synergism between the computation of the diagonal elements of the inertia matrix and that required by the conjugate gradient algorithm.
A class of parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing.
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2015-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively. PMID:26869876
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2016-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively. PMID:26869876
Multiplier Architecture for Coding Circuits
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.
1986-01-01
Multipliers based on new algorithm for Galois-field (GF) arithmetic regular and expandable. Pipeline structures used for computing both multiplications and inverses. Designs suitable for implementation in very-large-scale integrated (VLSI) circuits. This general type of inverter and multiplier architecture especially useful in performing finite-field arithmetic of Reed-Solomon error-correcting codes and of some cryptographic algorithms.
NASA Technical Reports Server (NTRS)
Kao, M. H.; Bodenheimer, R. E.
1976-01-01
The tse computer's capability of achieving image congruence between temporal and multiple images with misregistration due to rotational differences is reported. The coordinate transformations are obtained and a general algorithms is devised to perform image rotation using tse operations very efficiently. The details of this algorithm as well as its theoretical implications are presented. Step by step procedures of image registration are described in detail. Numerous examples are also employed to demonstrate the correctness and the effectiveness of the algorithms and conclusions and recommendations are made.
Development Of A Three-Dimensional Circuit Integration Technology And Computer Architecture
NASA Astrophysics Data System (ADS)
Etchells, R. D.; Grinberg, J.; Nudd, G. R.
1981-12-01
This paper is the first of a series 1,2,3 describing a range of efforts at Hughes Research Laboratories, which are collectively referred to as "Three-Dimensional Microelectronics." The technology being developed is a combination of a unique circuit fabrication/packaging technology and a novel processing architecture. The packaging technology greatly reduces the parasitic impedances associated with signal-routing in complex VLSI structures, while simultaneously allowing circuit densities orders of magnitude higher than the current state-of-the-art. When combined with the 3-D processor architecture, the resulting machine exhibits a one- to two-order of magnitude simultaneous improvement over current state-of-the-art machines in the three areas of processing speed, power consumption, and physical volume. The 3-D architecture is essentially that commonly referred to as a "cellular array", with the ultimate implementation having as many as 512 x 512 processors working in parallel. The three-dimensional nature of the assembled machine arises from the fact that the chips containing the active circuitry of the processor are stacked on top of each other. In this structure, electrical signals are passed vertically through the chips via thermomigrated aluminum feedthroughs. Signals are passed between adjacent chips by micro-interconnects. This discussion presents a broad view of the total effort, as well as a more detailed treatment of the fabrication and packaging technologies themselves. The results of performance simulations of the completed 3-D processor executing a variety of algorithms are also presented. Of particular pertinence to the interests of the focal-plane array community is the simulation of the UNICORNS nonuniformity correction algorithms as executed by the 3-D architecture.
ICASE Computer Science Program
NASA Technical Reports Server (NTRS)
1985-01-01
The Institute for Computer Applications in Science and Engineering computer science program is discussed in outline form. Information is given on such topics as problem decomposition, algorithm development, programming languages, and parallel architectures.
Service-Oriented Architecture for NVO and TeraGrid Computing
NASA Technical Reports Server (NTRS)
Jacob, Joseph; Miller, Craig; Williams, Roy; Steenberg, Conrad; Graham, Matthew
2008-01-01
The National Virtual Observatory (NVO) Extensible Secure Scalable Service Infrastructure (NESSSI) is a Web service architecture and software framework that enables Web-based astronomical data publishing and processing on grid computers such as the National Science Foundation's TeraGrid. Characteristics of this architecture include the following: (1) Services are created, managed, and upgraded by their developers, who are trusted users of computing platforms on which the services are deployed. (2) Service jobs can be initiated by means of Java or Python client programs run on a command line or with Web portals. (3) Access is granted within a graduated security scheme in which the size of a job that can be initiated depends on the level of authentication of the user.
ASIC-based architecture for the real-time computation of 2D convolution with large kernel size
NASA Astrophysics Data System (ADS)
Shao, Rui; Zhong, Sheng; Yan, Luxin
2015-12-01
Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the ASIC-based implementation of 2-D convolution with medium-large kernels. Aiming to improve the efficiency of storage resources on-chip, reducing off-chip bandwidth of these two issues, proposed construction of a data cache reuse. Multi-block SPRAM to cross cached images and the on-chip ping-pong operation takes full advantage of the data convolution calculation reuse, design a new ASIC data scheduling scheme and overall architecture. Experimental results show that the structure can achieve 40× 32 size of template real-time convolution operations, and improve the utilization of on-chip memory bandwidth and on-chip memory resources, the experimental results show that the structure satisfies the conditions to maximize data throughput output , reducing the need for off-chip memory bandwidth.
Optimization of knowledge sharing through multi-forum using cloud computing architecture
NASA Astrophysics Data System (ADS)
Madapusi Vasudevan, Sriram; Sankaran, Srivatsan; Muthuswamy, Shanmugasundaram; Ram, N. Sankar
2011-12-01
Knowledge sharing is done through various knowledge sharing forums which requires multiple logins through multiple browser instances. Here a single Multi-Forum knowledge sharing concept is introduced which requires only one login session which makes user to connect multiple forums and display the data in a single browser window. Also few optimization techniques are introduced here to speed up the access time using cloud computing architecture.
Cladé, Thierry; Snyder, Joshua C.
2010-01-01
Clinical trials which use imaging typically require data management and workflow integration across several parties. We identify opportunities for all parties involved to realize benefits with a modular interoperability model based on service-oriented architecture and grid computing principles. We discuss middleware products for implementation of this model, and propose caGrid as an ideal candidate due to its healthcare focus; free, open source license; and mature developer tools and support. PMID:20449775
New algorithms for the symmetric tridiagonal eigenvalue computation
Pan, V. |
1994-12-31
The author presents new algorithms that accelerate the bisection method for the symmetric eigenvalue problem. The algorithms rely on some new techniques, which include acceleration of Newton`s iteration and can also be further applied to acceleration of some other iterative processes, in particular, of iterative algorithms for approximating polynomial zeros.
NASA Astrophysics Data System (ADS)
Niknam, Mehdi; Thulasiraman, Parimala; Camorlinga, Sergio
2010-11-01
Connected component labelling is an essential step in image processing. We provide a parallel version of Suzuki's sequential connected component algorithm in order to speed up the labelling process. Also, we modify the algorithm to enable labelling gray-scale images. Due to the data dependencies in the algorithm we used a method similar to pipeline to exploit parallelism. The parallel algorithm method achieved a speedup of 2.5 for image size of 256 × 256 pixels using 4 processing threads.
CCARES: A computer algorithm for the reliability analysis of laminated CMC components
NASA Technical Reports Server (NTRS)
Duffy, Stephen F.; Gyekenyesi, John P.
1993-01-01
Structural components produced from laminated CMC (ceramic matrix composite) materials are being considered for a broad range of aerospace applications that include various structural components for the national aerospace plane, the space shuttle main engine, and advanced gas turbines. Specifically, these applications include segmented engine liners, small missile engine turbine rotors, and exhaust nozzles. Use of these materials allows for improvements in fuel efficiency due to increased engine temperatures and pressures, which in turn generate more power and thrust. Furthermore, this class of materials offers significant potential for raising the thrust-to-weight ratio of gas turbine engines by tailoring directions of high specific reliability. The emerging composite systems, particularly those with silicon nitride or silicon carbide matrix, can compete with metals in many demanding applications. Laminated CMC prototypes have already demonstrated functional capabilities at temperatures approaching 1400 C, which is well beyond the operational limits of most metallic materials. Laminated CMC material systems have several mechanical characteristics which must be carefully considered in the design process. Test bed software programs are needed that incorporate stochastic design concepts that are user friendly, computationally efficient, and have flexible architectures that readily incorporate changes in design philosophy. The CCARES (Composite Ceramics Analysis and Reliability Evaluation of Structures) program is representative of an effort to fill this need. CCARES is a public domain computer algorithm, coupled to a general purpose finite element program, which predicts the fast fracture reliability of a structural component under multiaxial loading conditions.
NASA Astrophysics Data System (ADS)
Ezer, Gulbin A.
2005-03-01
The challenges of new embedded applications have conflicting requirements: complex algorithms, evolving standards, shorter product cycles dictate programmable solutions, and yet, high data bandwidth, compute power and lower power consumption dictate carefully crafted hardwired functional modules. An application specific instruction set processor (ASIP) is ideally suited to provide most of the advantages of hardwired logic, while maintaining the time-to-market and programmability advantages of a general purpose processor. This paper presents the unique blend of high compute performance and i/o bandwidth of the configurable and extensible Xtensa LX ASIP architectures. Xtensa LX provides high compute performance with wide instruction words using multiple operation slots that enable superscalar performance suitable for data-intensive applications. Xtensa LX also provides high I/O bandwidth through its multiple load/store units that provide parallel low latency access or external DMA access to local memories and virtually unlimited number of ports and queues directly connected to the processor core functional units and system control registers, which remove the I/O bottleneck of traditional processors. The advantages of Xtensa LX features are proven with their impressive performance results: 171.6 ConsumerMark on out-of-the-box simulation of EEMBC consumer suite and a BDTIsimMark2000™ score of 6150 at 370MHz.
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
Alternative algorithms for computational fluid dynamics. Final report
Ladd, A.J.C.
1995-03-03
Fluid flow is conventionally modeled by finite difference or finite element approximations to the Navier-Stokes equations. The key problem in such calculations is devising an efficient computational mesh on which to solve the equations; if the geometry is complex, extensive human intervention is usually necessary. Thus these methods are unsuitable for problems such as the motion of solid particulates in suspension, where there may be many thousands of objects whose positions are constantly varying over the course of the simulation. Over the past few years I have developed an alternative strategy for modeling solid-fluid flows, based on a discrete Boltzmann model, in which the particle velocities are sampled from a small well-chosen set, commensurate with the underlying spatial lattice. This leads to a simple and fast numerical algorithm which can solve fluid flow problems with high accuracy on relatively crude spatial meshes. Thus it has been possible to track the motion of around 1000 hydrodynamically interacting particles on a desktop workstation. A preliminary account of some of this work was published in Physical Review Letters; a complete account of the method is given in two papers published by the Journal of Fluid Mechanics.
Computational Performance Assessment of k-mer Counting Algorithms.
Pérez, Nelson; Gutierrez, Miguel; Vera, Nelson
2016-04-01
This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables. PMID:26982880
Development of new flux splitting schemes. [computational fluid dynamics algorithms
NASA Technical Reports Server (NTRS)
Liou, Meng-Sing; Steffen, Christopher J., Jr.
1992-01-01
Maximizing both accuracy and efficiency has been the primary objective in designing a numerical algorithm for computational fluid dynamics (CFD). This is especially important for solutions of complex three dimensional systems of Navier-Stokes equations which often include turbulence modeling and chemistry effects. Recently, upwind schemes have been well received for their capability in resolving discontinuities. With this in mind, presented are two new flux splitting techniques for upwind differencing. The first method is based on High-Order Polynomial Expansions (HOPE) of the mass flux vector. The second new flux splitting is based on the Advection Upwind Splitting Method (AUSM). The calculation of the hypersonic conical flow demonstrates the accuracy of the splitting in resolving the flow in the presence of strong gradients. A second series of tests involving the two dimensional inviscid flow over a NACA 0012 airfoil demonstrates the ability of the AUSM to resolve the shock discontinuity at transonic speed. A third case calculates a series of supersonic flows over a circular cylinder. Finally, the fourth case deals with tests of a two dimensional shock wave/boundary layer interaction.
Algorithms for computer detection of symmetry elements in molecular systems.
Beruski, Otávio; Vidal, Luciano N
2014-02-01
Simple procedures for the location of proper and improper rotations and reflexion planes are presented. The search is performed with a molecule divided into subsets of symmetrically equivalent atoms (SEA) which are analyzed separately as if they were a single molecule. This approach is advantageous in many aspects. For instance, in those molecules that are symmetric rotors, the number of atoms and the inertia tensor of the SEA provide one straight way to find proper rotations of any order. The algorithms are invariant to the molecular orientation and their computational cost is low, because the main information required to find symmetry elements is interatomic distances and the principal moments of the SEA. For example, our Fortran implementation, running on a single processor, took only a few seconds to locate all 120 symmetry operations of the large and highly symmetrical fullerene C720, belonging to the Ih point group. Finally, we show how the interatomic distances matrix of a slightly unsymmetrical molecule is used to symmetrize its geometry. PMID:24403016
A constrained conjugate gradient algorithm for computed tomography
Azevedo, S.G.; Goodman, D.M.
1994-11-15
Image reconstruction from projections of x-ray, gamma-ray, protons and other penetrating radiation is a well-known problem in a variety of fields, and is commonly referred to as computed tomography (CT). Various analytical and series expansion methods of reconstruction and been used in the past to provide three-dimensional (3D) views of some interior quantity. The difficulties of these approaches lie in the cases where (a) the number of views attainable is limited, (b) the Poisson (or other) uncertainties are significant, (c) quantifiable knowledge of the object is available, but not implementable, or (d) other limitations of the data exist. We have adapted a novel nonlinear optimization procedure developed at LLNL to address limited-data image reconstruction problems. The technique, known as nonlinear least squares with general constraints or constrained conjugate gradients (CCG), has been successfully applied to a number of signal and image processing problems, and is now of great interest to the image reconstruction community. Previous applications of this algorithm to deconvolution problems and x-ray diffraction images for crystallography have shown the great promise.
Computational Tools and Algorithms for Designing Customized Synthetic Genes
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
Computational tools and algorithms for designing customized synthetic genes.
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
An Architecture and Supporting Environment of Service-Oriented Computing Based-On Context Awareness
NASA Astrophysics Data System (ADS)
Ma, Tianxiao; Wu, Gang; Huang, Jun
Service-oriented computing (SOC) is emerging to be an important computing paradigm of the next future. Based on context awareness, this paper proposes an architecture of SOC. A definition of the context in open environments such as Internet is given, which is based on ontology. The paper also proposes a supporting environment for the context-aware SOC, which focus on services on-demand composition and context-awareness evolving. A reference implementation of the supporting environment based on OSGi[11] is given at last.
NASA Astrophysics Data System (ADS)
Gerjuoy, Edward
2005-06-01
The security of messages encoded via the widely used RSA public key encryption system rests on the enormous computational effort required to find the prime factors of a large number N using classical (conventional) computers. In 1994 Peter Shor showed that for sufficiently large N, a quantum computer could perform the factoring with much less computational effort. This paper endeavors to explain, in a fashion comprehensible to the nonexpert, the RSA encryption protocol; the various quantum computer manipulations constituting the Shor algorithm; how the Shor algorithm performs the factoring; and the precise sense in which a quantum computer employing Shor's algorithm can be said to accomplish the factoring of very large numbers with less computational effort than a classical computer. It is made apparent that factoring N generally requires many successive runs of the algorithm. Our analysis reveals that the probability of achieving a successful factorization on a single run is about twice as large as commonly quoted in the literature.
Efficient algorithm to compute mutually connected components in interdependent networks.
Hwang, S; Choi, S; Lee, Deokjae; Kahng, B
2015-02-01
Mutually connected components (MCCs) play an important role as a measure of resilience in the study of interdependent networks. Despite their importance, an efficient algorithm to obtain the statistics of all MCCs during the removal of links has thus far been absent. Here, using a well-known fully dynamic graph algorithm, we propose an efficient algorithm to accomplish this task. We show that the time complexity of this algorithm is approximately O(N(1.2)) for random graphs, which is more efficient than O(N(2)) of the brute-force algorithm. We confirm the correctness of our algorithm by comparing the behavior of the order parameter as links are removed with existing results for three types of double-layer multiplex networks. We anticipate that this algorithm will be used for simulations of large-size systems that have been previously inaccessible. PMID:25768559
NASA Astrophysics Data System (ADS)
Cho, Yoong-Goog; Chandrasekar, V.; Jayasumana, Anura P.; Brunkow, David
2002-06-01
he design, architecture, and implementation for the high-throughput data transmission and high-performance computing,which are applicable for various real-time radar signal transmission applications over the data network, are presented. With a client-server model, the multiple processes and threads on the end systems operate simultaneously and collaborately to meet the real-time requirement. The design covers the Digitized Radar Signal (DRS) data acquisition and data transmission on the DRS server end as well as DRS data receiving, radar signal parameter computation and parameter transmission on the DRS receiver end. Generic packet and data structures for transmission and inter-process data sharing are constructed. The architecture was successfully implemented on Sun/Solaris workstations with dual 750 MHz UltraSPARC-III processors containing Gigabit Ethernet card. The comparison in transmission throughput over gigabit link between with computation and without computation clearly shows the importance of the signal processing capability on the end-to-end performance. Profiling analysis on the DRS receiver process shows the work-loaded functions and provides guides for improving computing capabilities.
Quantum computation: algorithms and implementation in quantum dot devices
NASA Astrophysics Data System (ADS)
Gamble, John King
In this thesis, we explore several aspects of both the software and hardware of quantum computation. First, we examine the computational power of multi-particle quantum random walks in terms of distinguishing mathematical graphs. We study both interacting and non-interacting multi-particle walks on strongly regular graphs, proving some limitations on distinguishing powers and presenting extensive numerical evidence indicative of interactions providing more distinguishing power. We then study the recently proposed adiabatic quantum algorithm for Google PageRank, and show that it exhibits power-law scaling for realistic WWW-like graphs. Turning to hardware, we next analyze the thermal physics of two nearby 2D electron gas (2DEG), and show that an analogue of the Coulomb drag effect exists for heat transfer. In some distance and temperature, this heat transfer is more significant than phonon dissipation channels. After that, we study the dephasing of two-electron states in a single silicon quantum dot. Specifically, we consider dephasing due to the electron-phonon coupling and charge noise, separately treating orbital and valley excitations. In an ideal system, dephasing due to charge noise is strongly suppressed due to a vanishing dipole moment. However, introduction of disorder or anharmonicity leads to large effective dipole moments, and hence possibly strong dephasing. Building on this work, we next consider more realistic systems, including structural disorder systems. We present experiment and theory, which demonstrate energy levels that vary with quantum dot translation, implying a structurally disordered system. Finally, we turn to the issues of valley mixing and valley-orbit hybridization, which occurs due to atomic-scale disorder at quantum well interfaces. We develop a new theoretical approach to study these effects, which we name the disorder-expansion technique. We demonstrate that this method successfully reproduces atomistic tight-binding techniques
Computationally efficient algorithm for high sampling-frequency operation of active noise control
NASA Astrophysics Data System (ADS)
Rout, Nirmal Kumar; Das, Debi Prasad; Panda, Ganapati
2015-05-01
In high sampling-frequency operation of active noise control (ANC) system the length of the secondary path estimate and the ANC filter are very long. This increases the computational complexity of the conventional filtered-x least mean square (FXLMS) algorithm. To reduce the computational complexity of long order ANC system using FXLMS algorithm, frequency domain block ANC algorithms have been proposed in past. These full block frequency domain ANC algorithms are associated with some disadvantages such as large block delay, quantization error due to computation of large size transforms and implementation difficulties in existing low-end DSP hardware. To overcome these shortcomings, the partitioned block ANC algorithm is newly proposed where the long length filters in ANC are divided into a number of equal partitions and suitably assembled to perform the FXLMS algorithm in the frequency domain. The complexity of this proposed frequency domain partitioned block FXLMS (FPBFXLMS) algorithm is quite reduced compared to the conventional FXLMS algorithm. It is further reduced by merging one fast Fourier transform (FFT)-inverse fast Fourier transform (IFFT) combination to derive the reduced structure FPBFXLMS (RFPBFXLMS) algorithm. Computational complexity analysis for different orders of filter and partition size are presented. Systematic computer simulations are carried out for both the proposed partitioned block ANC algorithms to show its accuracy compared to the time domain FXLMS algorithm.
NASA Astrophysics Data System (ADS)
Meyer-Baese, Uwe H.; Meyer-Baese, Anke; Ramirez, Javier; Garcia, Antonio
2003-08-01
In this paper, a new parallel hardware architecture dedicated to compute the Gaussian Potential Function is proposed. This function is commonly utilized in neural radial basis classifiers for pattern recognition as described by Lee; Girosi and Poggio; and Musavi et al. Attention to a simplified Gaussian Potential Function which processes uncorrelated features is confined. Operations of most interest included by the Gaussian potential function are the exponential and the square function. Our hardware computes the exponential function and its exponent at the same time. The contributions of all features to the exponent are computed in parallel. This parallelism reduces computational delay in the output function. The duration does not depend on the number of features processed. Software and hardware case studies are presented to evaluate the new CORDIC.
Architecture for event-driven real-time distributed computer systems
McDonald, J.E.
1983-01-01
The author describes a proposed preliminary system design that includes hardware and software for real-time distributed computer systems. This new system is appropriate as a digital avionics architecture or as a real-time multi-computer simulation system using a mixture of computers, mainframes to micros. The hardware contains a network that employs high-speed serial data transmission concepts in emulating a multicomputer shared memory system. The distributed multicomputer system then capitalizes on the attributes of the hardware by structuring the real-time software as the data-driven input-output system. The real-time software executes only on demand and not synchronously as in conventional real-time systems. Background information concerning multi-computer systems using serial and parallel data transmission networks is given. This information supports the design rationale of the proposed hardware system which is basically a technology blend of conventional serial and parallel transmission schemes. 2 references.
ERIC Educational Resources Information Center
Arumi, Francisco N.
Computer programs capable of describing the thermal behavior of buildings are used to help architectural students understand environmental systems. The Numerical Simulation Laboratory at the Architectural School of the University of Texas at Austin was developed to provide the necessary software capable of simulating the energy transactions…
NASA Astrophysics Data System (ADS)
Smail, Linda
2016-06-01
The basic task of any probabilistic inference system in Bayesian networks is computing the posterior probability distribution for a subset or subsets of random variables, given values or evidence for some other variables from the same Bayesian network. Many methods and algorithms have been developed to exact and approximate inference in Bayesian networks. This work compares two exact inference methods in Bayesian networks-Lauritzen-Spiegelhalter and the successive restrictions algorithm-from the perspective of computational efficiency. The two methods were applied for comparison to a Chest Clinic Bayesian Network. Results indicate that the successive restrictions algorithm shows more computational efficiency than the Lauritzen-Spiegelhalter algorithm.
NASA Astrophysics Data System (ADS)
Chen, Yufeng; Wu, Zebin; Sun, Le; Wei, Zhihui; Li, Yonglong
2016-04-01
With the gradual increase in the spatial and spectral resolution of hyperspectral images, the size of image data becomes larger and larger, and the complexity of processing algorithms is growing, which poses a big challenge to efficient massive hyperspectral image processing. Cloud computing technologies distribute computing tasks to a large number of computing resources for handling large data sets without the limitation of memory and computing resource of a single machine. This paper proposes a parallel pixel purity index (PPI) algorithm for unmixing massive hyperspectral images based on a MapReduce programming model for the first time in the literature. According to the characteristics of hyperspectral images, we describe the design principle of the algorithm, illustrate the main cloud unmixing processes of PPI, and analyze the time complexity of serial and parallel algorithms. Experimental results demonstrate that the parallel implementation of the PPI algorithm on the cloud can effectively process big hyperspectral data and accelerate the algorithm.
An adaptive replacement algorithm for paged-memory computer systems.
NASA Technical Reports Server (NTRS)
Thorington, J. M., Jr.; Irwin, J. D.
1972-01-01
A general class of adaptive replacement schemes for use in paged memories is developed. One such algorithm, called SIM, is simulated using a probability model that generates memory traces, and the results of the simulation of this adaptive scheme are compared with those obtained using the best nonlookahead algorithms. A technique for implementing this type of adaptive replacement algorithm with state of the art digital hardware is also presented.
ERIC Educational Resources Information Center
Uwakonye, Obioha; Alagbe, Oluwole; Oluwatayo, Adedapo; Alagbe, Taiye; Alalade, Gbenga
2015-01-01
As a result of globalization of digital technology, intellectual discourse on what constitutes the basic body of architectural knowledge to be imparted to future professionals has been on the increase. This digital revolution has brought to the fore the need to review the already overloaded architectural education curriculum of Nigerian schools of…
Mental Computation or Standard Algorithm? Children's Strategy Choices on Multi-Digit Subtractions
ERIC Educational Resources Information Center
Torbeyns, Joke; Verschaffel, Lieven
2016-01-01
This study analyzed children's use of mental computation strategies and the standard algorithm on multi-digit subtractions. Fifty-eight Flemish 4th graders of varying mathematical achievement level were individually offered subtractions that either stimulated the use of mental computation strategies or the standard algorithm in one choice and two…
A 0.13-µm implementation of 5 Gb/s and 3-mW folded parallel architecture for AES algorithm
NASA Astrophysics Data System (ADS)
Rahimunnisa, K.; Karthigaikumar, P.; Kirubavathy, J.; Jayakumar, J.; Kumar, S. Suresh
2014-02-01
A new architecture for encrypting and decrypting the confidential data using Advanced Encryption Standard algorithm is presented in this article. This structure combines the folded structure with parallel architecture to increase the throughput. The whole architecture achieved high throughput with less power. The proposed architecture is implemented in 0.13-µm Complementary metal-oxide-semiconductor (CMOS) technology. The proposed structure is compared with different existing structures, and from the result it is proved that the proposed structure gives higher throughput and less power compared to existing works.
NASA Astrophysics Data System (ADS)
Jolly, Sundeep; Smalley, Daniel; Barabas, James; Bove, V. Michael
2014-02-01
The MIT Mark IV holographic display system employs a novel anisotropic leaky-mode spatial light modulator that allows for the simultaneous and superimposed modulation of red, green, and blue light via wavelength-division multiplexing. This WDM-based scheme for full-color display requires that incoming video signals containing holographic fringe information are comprised of non-overlapping spectral bands that fall within the available 200 MHz output bandwidth of commercial GPUs. These bands correspond to independent color channels in the display output and are appropriately band-limited and centered to match the multiplexed passbands and center frequencies in the frequency response of the mode-coupling device. The computational architecture presented in this paper involves the computation of holographic fringe patterns for each color channel and their summation in generating a single video signal for input to the display. In composite, 18 such input signals, each containing holographic fringe information for 26 horizontal-parallax only holographic lines, are generated via three dual-head GPUs for a total of 468 holographic lines in the display output. We present a general scheme for full-color CGH computation for input to Mark IV and furthermore depict the adaptation of the diffraction specific coherent panoramagram approach to fringe computation for the Mark IV architecture.
Ultra-Fast Data-Mining Hardware Architecture Based on Stochastic Computing
Oliver, Antoni; Alomar, Miquel L.
2015-01-01
Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area. PMID:25955274
Ultra-fast data-mining hardware architecture based on stochastic computing.
Morro, Antoni; Canals, Vincent; Oliver, Antoni; Alomar, Miquel L; Rossello, Josep L
2015-01-01
Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area. PMID:25955274
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the fifth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 5 describes the low-level data management component of the NICE software. It is intended only for advanced programmers involved in maintenance of the software.
NASA Technical Reports Server (NTRS)
Wright, Mary A.; Regelbrugge, Marc E.; Felippa, Carlos A.
1989-01-01
This is the fourth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 4 describes the nominal-record data management component of the NICE software. It is intended for all users.
SSME structural computer program development: BOPACE theoretical manual, addendum. [algorithms
NASA Technical Reports Server (NTRS)
1975-01-01
An algorithm developed and incorporated into BOPACE for improving the convergence and accuracy of the inelastic stress-strain calculations is discussed. The implementation of separation of strains in the residual-force iterative procedure is defined. The elastic-plastic quantities used in the strain-space algorithm are defined and compared with previous quantities.
Resampling Algorithms for Particle Filters: A Computational Complexity Perspective
NASA Astrophysics Data System (ADS)
Bolić, Miodrag; Djurić, Petar M.; Hong, Sangjin
2004-12-01
Newly developed resampling algorithms for particle filters suitable for real-time implementation are described and their analysis is presented. The new algorithms reduce the complexity of both hardware and DSP realization through addressing common issues such as decreasing the number of operations and memory access. Moreover, the algorithms allow for use of higher sampling frequencies by overlapping in time the resampling step with the other particle filtering steps. Since resampling is not dependent on any particular application, the analysis is appropriate for all types of particle filters that use resampling. The performance of the algorithms is evaluated on particle filters applied to bearings-only tracking and joint detection and estimation in wireless communications. We have demonstrated that the proposed algorithms reduce the complexity without performance degradation.
Efficient graph algorithms for sequential and parallel computers. Doctoral thesis
Goldberg, A.V.
1987-02-01
This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.
Apparatuses and Methods for Producing Runtime Architectures of Computer Program Modules
NASA Technical Reports Server (NTRS)
Abi-Antoun, Marwan Elia (Inventor); Aldrich, Jonathan Erik (Inventor)
2013-01-01
Apparatuses and methods for producing run-time architectures of computer program modules. One embodiment includes creating an abstract graph from the computer program module and from containment information corresponding to the computer program module, wherein the abstract graph has nodes including types and objects, and wherein the abstract graph relates an object to a type, and wherein for a specific object the abstract graph relates the specific object to a type containing the specific object; and creating a runtime graph from the abstract graph, wherein the runtime graph is a representation of the true runtime object graph, wherein the runtime graph represents containment information such that, for a specific object, the runtime graph relates the specific object to another object that contains the specific object.
NASA Astrophysics Data System (ADS)
Xing, Jianchao; Zhang, Jie; Zhao, Yongli; Cao, Xuping; Wang, Dajiang; Gu, Wanyi
2011-12-01
The traditional approach for inter-domain Traffic Engineering Label Switching Path (TE-LSP) computation like BRPC could provide a shortest inter-domain constrained TE-LSP, but under wavelength continuity constraint, it couldn't guarantee the success of the resources reservation for the shortest path. In this paper, a Collision-aware Backward Recursive PCE-based Computation Algorithm (CA-BRPC) in multi-domain optical networks under wavelength continuity constraint is proposed, which is implemented based on Hierarchical PCE (H-PCE) architecture, could provide an optimal inter-domain TE-LSP and avoid resources reservation conflict. Numeric results show that the CA-BRPC could reduce the blocking probability of entire network.
3D-SoftChip: A Novel Architecture for Next-Generation Adaptive Computing Systems
NASA Astrophysics Data System (ADS)
Kim, Chul; Rassau, Alex; Lachowicz, Stefan; Lee, Mike Myung-Ok; Eshraghian, Kamran
2006-12-01
This paper introduces a novel architecture for next-generation adaptive computing systems, which we term 3D-SoftChip. The 3D-SoftChip is a 3-dimensional (3D) vertically integrated adaptive computing system combining state-of-the-art processing and 3D interconnection technology. It comprises the vertical integration of two chips (a configurable array processor and an intelligent configurable switch) through an indium bump interconnection array (IBIA). The configurable array processor (CAP) is an array of heterogeneous processing elements (PEs), while the intelligent configurable switch (ICS) comprises a switch block, 32-bit dedicated RISC processor for control, on-chip program/data memory, data frame buffer, along with a direct memory access (DMA) controller. This paper introduces the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing as a next-generation computing system. The paper further describes the advanced HW/SW codesign and verification methodology, including high-level system modeling of the 3D-SoftChip using SystemC, being used to determine the optimum hardware specification in the early design stage.
A connectionist architecture for view-independent grip-aperture computation.
Prevete, Roberto; Tessitore, Giovanni; Santoro, Matteo; Catanzariti, Ezio
2008-08-15
This paper addresses the problem of extracting view-invariant visual features for the recognition of object-directed actions and introduces a computational model of how these visual features are processed in the brain. In particular, in the test-bed setting of reach-to-grasp actions, grip aperture is identified as a good candidate for inclusion into a parsimonious set of hand high-level features describing overall hand movement during reach-to-grasp actions. The computational model NeGOI (neural network architecture for measuring grip aperture in an observer-independent way) for extracting grip aperture in a view-independent fashion was developed on the basis of functional hypotheses about cortical areas that are involved in visual processing. An assumption built into NeGOI is that grip aperture can be measured from the superposition of a small number of prototypical hand shapes corresponding to predefined grip-aperture sizes. The key idea underlying the NeGOI model is to introduce view-independent units (VIP units) that are selective for prototypical hand shapes, and to integrate the output of VIP units in order to compute grip aperture. The distinguishing traits of the NEGOI architecture are discussed together with results of tests concerning its view-independence and grip-aperture recognition properties. The overall functional organization of NEGOI model is shown to be coherent with current functional models of the ventral visual stream, up to and including temporal area STS. Finally, the functional role of the NeGOI model is examined from the perspective of a biologically plausible architecture which provides a parsimonious set of high-level and view-independent visual features as input to mirror systems. PMID:18538746
One high-accuracy camera calibration algorithm based on computer vision images
NASA Astrophysics Data System (ADS)
Wang, Ying; Huang, Jianming; Wei, Xiangquan
2015-12-01
Camera calibration is the first step of computer vision and one of the most active research fields nowadays. In order to improve the measurement precision, the internal parameters of the camera should be accurately calibrated. So one high-accuracy camera calibration algorithm is proposed based on the images of planar targets or tridimensional targets. By using the algorithm, the internal parameters of the camera are calibrated based on the existing planar target at the vision-based navigation experiment. The experimental results show that the accuracy of the proposed algorithm is obviously improved compared with the conventional linear algorithm, Tsai general algorithm, and Zhang Zhengyou calibration algorithm. The algorithm proposed by the article can satisfy the need of computer vision and provide reference for precise measurement of the relative position and attitude.
Linnanto, Juha Matti; Freiberg, Arvi
2014-10-06
We have used different computational methods to study structural architecture, and light-harvesting and energy transfer properties of the photosynthetic unit of filamentous anoxygenic phototrophs. Due to the huge number of atoms in the photosynthetic unit, a combination of atomistic and coarse methods was used for electronic structure calculations. The calculations reveal that the light energy absorbed by the peripheral chlorosome antenna complex transfers efficiently via the baseplate and the core B808–866 antenna complexes to the reaction center complex, in general agreement with the present understanding of this complex system.
Algorithm development for Maxwell's equations for computational electromagnetism
NASA Technical Reports Server (NTRS)
Goorjian, Peter M.
1990-01-01
A new algorithm has been developed for solving Maxwell's equations for the electromagnetic field. It solves the equations in the time domain with central, finite differences. The time advancement is performed implicitly, using an alternating direction implicit procedure. The space discretization is performed with finite volumes, using curvilinear coordinates with electromagnetic components along those directions. Sample calculations are presented of scattering from a metal pin, a square and a circle to demonstrate the capabilities of the new algorithm.
Fast algorithm for automatically computing Strahler stream order
Lanfear, Kenneth J.
1990-01-01
An efficient algorithm was developed to determine Strahler stream order for segments of stream networks represented in a Geographic Information System (GIS). The algorithm correctly assigns Strahler stream order in topologically complex situations such as braided streams and multiple drainage outlets. Execution time varies nearly linearly with the number of stream segments in the network. This technique is expected to be particularly useful for studying the topology of dense stream networks derived from digital elevation model data.
NASA Technical Reports Server (NTRS)
Fijany, Amir; Toomarian, Benny N.
2000-01-01
There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA
Addendum to 'A new hybrid algorithm for computing a fast discrete Fourier transform'
NASA Technical Reports Server (NTRS)
Reed, I. S.; Truong, T. K.; Benjauthrit, B.
1981-01-01
The reported investigation represents a continuation of a study conducted by Reed and Truong (1979), who proposed a hybrid algorithm for computing the discrete Fourier transform (DFT). The proposed technique employs a Winograd-type algorithm in conjunction with the Mersenne prime-number theoretic transform to perform a DFT. The implementation of the technique involves a considerable number of additions. The new investigation shows an approach which can reduce the number of additions significantly. It is proposed to use Winograd's algorithm for computing the Mersenne prime-number theoretic transform in the transform portion of the hybrid algorithm.
Parallel algorithms and archtectures for computational structural mechanics
NASA Technical Reports Server (NTRS)
Patrick, Merrell; Ma, Shing; Mahajan, Umesh
1989-01-01
The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.
Funeriu; Lehn; Fromm; Fenske
2000-06-16
The multisubunit ligand 2 combines two complexation substructures known to undergo, with specific metal ions, distinct self-assembly processes to form a double-helical and a grid-type structure, respectively. The binding information contained in this molecular strand may be expected to generate, in a strictly predetermined and univocal fashion, two different, well-defined output inorganic architectures depending on the set of metal ions, that is, on the coordination algorithm used. Indeed, as predicted, the self-assembly of 2 with eight CuII and four CuI yields the intertwined structure D1. It results from a crossover of the two assembly subprograms and has been fully characterized by crystal structure determination. On the other hand, when the instructions of strand 2 are read out with a set of eight CuI and four MII (M = Fe, Co, Ni, Cu) ions, the architectures C1-C4, resulting from a linear combination of the two subprograms, are obtained, as indicated by the available physico-chemical and spectral data. Redox interconversion of D1 and C4 has been achieved. These results indicate that the same molecular information may yield different output structures depending on how it is processed, that is, depending on the interactional (coordination) algorithm used to read it. They have wide implications for the design and implementation of programmed chemical systems, pointing towards multiprocessing capacity, in a one code/ several outputs scheme, of potential significance for molecular computation processes and possibly even with respect to information processing in biology. PMID:10926214
NASA Astrophysics Data System (ADS)
Zhang, Leihong; Liang, Dong; Li, Bei; Kang, Yi; Pan, Zilan; Zhang, Dawei; Gao, Xiumin; Ma, Xiuhua
2016-07-01
On the basis of analyzing the cosine light field with determined analytic expression and the pseudo-inverse method, the object is illuminated by a presetting light field with a determined discrete Fourier transform measurement matrix, and the object image is reconstructed by the pseudo-inverse method. The analytic expression of the algorithm of computational ghost imaging based on discrete Fourier transform measurement matrix is deduced theoretically, and compared with the algorithm of compressive computational ghost imaging based on random measurement matrix. The reconstruction process and the reconstruction error are analyzed. On this basis, the simulation is done to verify the theoretical analysis. When the sampling measurement number is similar to the number of object pixel, the rank of discrete Fourier transform matrix is the same as the one of the random measurement matrix, the PSNR of the reconstruction image of FGI algorithm and PGI algorithm are similar, the reconstruction error of the traditional CGI algorithm is lower than that of reconstruction image based on FGI algorithm and PGI algorithm. As the decreasing of the number of sampling measurement, the PSNR of reconstruction image based on FGI algorithm decreases slowly, and the PSNR of reconstruction image based on PGI algorithm and CGI algorithm decreases sharply. The reconstruction time of FGI algorithm is lower than that of other algorithms and is not affected by the number of sampling measurement. The FGI algorithm can effectively filter out the random white noise through a low-pass filter and realize the reconstruction denoising which has a higher denoising capability than that of the CGI algorithm. The FGI algorithm can improve the reconstruction accuracy and the reconstruction speed of computational ghost imaging.
Topics in Computational Learning Theory and Graph Algorithms.
ERIC Educational Resources Information Center
Board, Raymond Acton
This thesis addresses problems from two areas of theoretical computer science. The first area is that of computational learning theory, which is the study of the phenomenon of concept learning using formal mathematical models. The goal of computational learning theory is to investigate learning in a rigorous manner through the use of techniques…
Simulation of Si:P spin-based quantum computer architecture
Chang Yiachung; Fang Angbo
2008-11-07
We present realistic simulation for single and double phosphorous donors in a silicon-based quantum computer design by solving a valley-orbit coupled effective-mass equation for describing phosphorous donors in strained silicon quantum well (QW). Using a generalized unrestricted Hartree-Fock method, we solve the two-electron effective-mass equation with quantum well confinement and realistic gate potentials. The effects of QW width, gate voltages, donor separation, and donor position shift on the lowest singlet and triplet energies and their charge distributions for a neighboring donor pair in the quantum computer(QC) architecture are analyzed. The gate tunability are defined and evaluated for a typical QC design. Estimates are obtained for the duration of spin half-swap gate operation.
Arnold, Susan F; Ramachandran, Gurumurthy
2014-01-01
This study evaluated the influence of parameter values and variances and model architecture on modeled exposures, and identified important data gaps that influence lack-of-knowledge-related uncertainty, using Consexpo 4.1 as an illustrative case study. Understanding the influential determinants in exposure estimates enables more informed and appropriate use of this model and the resulting exposure estimates. In exploring the influence of parameter placement in an algorithm and of the values and variances chosen to characterize the parameters within ConsExpo, "sensitive" and "important" parameters were identified: product amount, weight fraction, exposure duration, exposure time, and ventilation rate were deemed "important," or "always sensitive." With this awareness, exposure assessors can strategically focus on acquiring the most robust estimates for these parameters. ConsExpo relies predominantly on three algorithms to assess the default scenarios: inhalation vapors evaporation equation using the Langmuir mass transfer, the dermal instant application with diffusion through the skin, and the oral ingestion by direct uptake algorithm. These algorithms, which do not necessarily render health conservative estimates, account for 87, 89 and 59% of the inhalation, dermal and oral default scenario assessments,respectively, according them greater influence relative to the less frequently used algorithms. Default data provided in ConsExpo may be useful to initiate assessments, but are insufficient for determining exposure acceptability or setting policy, as parameters defined by highly uncertain values produce biased estimates that may not be health conservative. Furthermore, this lack-of-knowledge uncertainty makes the magnitude of this bias uncertain. Significant data gaps persist for product amount, exposure time, and exposure duration. These "important" parameters exert influence in requiring broad values and variances to account for their uncertainty. Prioritizing
Peer-to-peer architectures for exascale computing : LDRD final report.
Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.; Armstrong, Robert C.; Rudish, Donald W.
2010-09-01
The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2