Algorithms versus architectures for computational chemistry
NASA Technical Reports Server (NTRS)
Partridge, H.; Bauschlicher, C. W., Jr.
1986-01-01
The algorithms employed are computationally intensive and, as a result, increased performance (both algorithmic and architectural) is required to improve accuracy and to treat larger molecular systems. Several benchmark quantum chemistry codes are examined on a variety of architectures. While these codes are only a small portion of a typical quantum chemistry library, they illustrate many of the computationally intensive kernels and data manipulation requirements of some applications. Furthermore, understanding the performance of the existing algorithm on present and proposed supercomputers serves as a guide for future programs and algorithm development. The algorithms investigated are: (1) a sparse symmetric matrix vector product; (2) a four index integral transformation; and (3) the calculation of diatomic two electron Slater integrals. The vectorization strategies are examined for these algorithms for both the Cyber 205 and Cray XMP. In addition, multiprocessor implementations of the algorithms are looked at on the Cray XMP and on the MIT static data flow machine proposed by DENNIS.
A Simple Physical Optics Algorithm Perfect for Parallel Computing Architecture
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1994-01-01
A reflector antenna computer program based upon a simple discreet approximation of the radiation integral has proven to be extremely easy to adapt to the parallel computing architecture of the modest number of large-gain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
Design And Implementation Of A Multi-Sensor Fusion Algorithm On A Hypercube Computer Architecture
NASA Astrophysics Data System (ADS)
Glover, Charles W.
1990-03-01
A multi-sensor integration (MSI) algorithm written for sequential single processor computer architecture has been transformed into a concurrent algorithm and implemented in parallel on a multi-processor hypercube computer architecture. This paper will present the philosophy and methodologies used in the decomposition of the sequential MSI algorithm, and its transformation into a parallel MSI algorithm. The parallel MSI algorithm was implemented on a NCUBETM hypercube computer. The performance of the parallel MSI algorithm has been measured and compared against its sequential counterpart by running test case scenarios through a simulation program. The simulation program allows the user to define the trajectories of all players in the scenario, and to pick the sensor suites of the players and their operating characteristics. For example, an air-to-air engagement scenario was used as one of the test cases. In this scenario, two friend aircrafts were being attacked by six foe aircraft in a pincer maneuver. Both the friend and foe aircrafts launch missiles at several different time points in the engagement. The sensor suites on each aircraft are dual mode RADAR, dual mode IRST, and ESM sensors. The modes of the sensors are switched as needed throughout the scenario. The RADAR sensor is used only intermittently, thus most of the MSI information is obtained from passive sensing. The maneuvers in this scenario caused aircraft and missile to constantly fly in and out of sensors field-of-view (F0V). This resulted in the MSI algorithm to constantly reacquire, initiate, and delete new tracks as it tracked all objects in the scenario. The objective was to determine performance of the parallel MSI algorithm in such a complex environment, and to determine how many multi-processors (nodes) of the hypercube could be effectively used by an aircraft in such an environment. For the scenario just discussed, a 4-node hypercube was found to be the optimal size and a factor two in speedup
NASA Technical Reports Server (NTRS)
Carroll, Chester C.; Youngblood, John N.; Saha, Aindam
1987-01-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Carroll, C.C.; Youngblood, J.N.; Saha, A.
1987-12-01
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Tiberghien, J.
1984-01-01
This book presents papers on supercomputers. Topics considered include decentralized computer architecture, new programming languages, data flow computers, reduction computers, parallel prefix calculations, structural and behavioral descriptions of digital systems, instruction sets, software generation, personal computing, and computer architecture education.
Feng, Lu; Fedrigo, Enrico; Béchet, Clémentine; Brunner, Elisabeth; Pirani, Werther
2012-06-01
The European Southern Observatory (ESO) is studying the next generation giant telescope, called the European Extremely Large Telescope (E-ELT). With a 42 m diameter primary mirror, it is a significant step from currently existing telescopes. Therefore, the E-ELT with its instruments poses new challenges in terms of cost and computational complexity for the control system, including its adaptive optics (AO). Since the conventional matrix-vector multiplication (MVM) method successfully used so far for AO wavefront reconstruction cannot be efficiently scaled to the size of the AO systems on the E-ELT, faster algorithms are needed. Among those recently developed wavefront reconstruction algorithms, three are studied in this paper from the point of view of design, implementation, and absolute speed on three multicore multi-CPU platforms. We focus on a single-conjugate AO system for the E-ELT. The algorithms are the MVM, the Fourier transform reconstructor (FTR), and the fractal iterative method (FRiM). This study enhances the scaling of these algorithms with an increasing number of CPUs involved in the computation. We discuss implementation strategies, depending on various CPU architecture constraints, and we present the first quantitative execution times so far at the E-ELT scale. MVM suffers from a large computational burden, making the current computing platform undersized to reach timings short enough for AO wavefront reconstruction. In our study, the FTR provides currently the fastest reconstruction. FRiM is a recently developed algorithm, and several strategies are investigated and presented here in order to implement it for real-time AO wavefront reconstruction, and to optimize its execution time. The difficulty to parallelize the algorithm in such architecture is enhanced. We also show that FRiM can provide interesting scalability using a sparse matrix approach. PMID:22695596
Feng, Lu; Fedrigo, Enrico; Béchet, Clémentine; Brunner, Elisabeth; Pirani, Werther
2012-06-01
The European Southern Observatory (ESO) is studying the next generation giant telescope, called the European Extremely Large Telescope (E-ELT). With a 42 m diameter primary mirror, it is a significant step from currently existing telescopes. Therefore, the E-ELT with its instruments poses new challenges in terms of cost and computational complexity for the control system, including its adaptive optics (AO). Since the conventional matrix-vector multiplication (MVM) method successfully used so far for AO wavefront reconstruction cannot be efficiently scaled to the size of the AO systems on the E-ELT, faster algorithms are needed. Among those recently developed wavefront reconstruction algorithms, three are studied in this paper from the point of view of design, implementation, and absolute speed on three multicore multi-CPU platforms. We focus on a single-conjugate AO system for the E-ELT. The algorithms are the MVM, the Fourier transform reconstructor (FTR), and the fractal iterative method (FRiM). This study enhances the scaling of these algorithms with an increasing number of CPUs involved in the computation. We discuss implementation strategies, depending on various CPU architecture constraints, and we present the first quantitative execution times so far at the E-ELT scale. MVM suffers from a large computational burden, making the current computing platform undersized to reach timings short enough for AO wavefront reconstruction. In our study, the FTR provides currently the fastest reconstruction. FRiM is a recently developed algorithm, and several strategies are investigated and presented here in order to implement it for real-time AO wavefront reconstruction, and to optimize its execution time. The difficulty to parallelize the algorithm in such architecture is enhanced. We also show that FRiM can provide interesting scalability using a sparse matrix approach.
McGhee, J.M.; Roberts, R.M.; Morel, J.E.
1997-06-01
A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner for scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.
Introduction to systolic algorithms and architectures
Bentley, J.L.; Kung, H.T.
1983-01-01
The authors survey the class of systolic special-purpose computer architectures and algorithms, which are particularly well-suited for implementation in very large scale integrated circuitry (VLSI). They give a brief introduction to systolic arrays for a reader with a broad technical background and some experience in using a computer, but who is not necessarily a computer scientist. In addition they briefly survey the technological advances in VLSI that led to the development of systolic algorithms and architectures. 38 references.
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
Architecture Adaptive Computing Environment
NASA Technical Reports Server (NTRS)
Dorband, John E.
2006-01-01
Architecture Adaptive Computing Environment (aCe) is a software system that includes a language, compiler, and run-time library for parallel computing. aCe was developed to enable programmers to write programs, more easily than was previously possible, for a variety of parallel computing architectures. Heretofore, it has been perceived to be difficult to write parallel programs for parallel computers and more difficult to port the programs to different parallel computing architectures. In contrast, aCe is supportable on all high-performance computing architectures. Currently, it is supported on LINUX clusters. aCe uses parallel programming constructs that facilitate writing of parallel programs. Such constructs were used in single-instruction/multiple-data (SIMD) programming languages of the 1980s, including Parallel Pascal, Parallel Forth, C*, *LISP, and MasPar MPL. In aCe, these constructs are extended and implemented for both SIMD and multiple- instruction/multiple-data (MIMD) architectures. Two new constructs incorporated in aCe are those of (1) scalar and virtual variables and (2) pre-computed paths. The scalar-and-virtual-variables construct increases flexibility in optimizing memory utilization in various architectures. The pre-computed-paths construct enables the compiler to pre-compute part of a communication operation once, rather than computing it every time the communication operation is performed.
NASA Astrophysics Data System (ADS)
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.
2011-07-01
We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 × 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.
Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G
2011-07-01
In this paper we describe and evaluate a fast implementation of a classical block matching motion estimation algorithm for multiple Graphical Processing Units (GPUs) using the Compute Unified Device Architecture (CUDA) computing engine. The implemented block matching algorithm (BMA) uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and non-integer search grids.The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a non-integer search grid. The additional speedup for non-integer search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable.In addition we compared execution time of the proposed FS GPU implementation with two existing, highly optimized non-full grid search CPU based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and Simplified Unsymmetrical multi-Hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation.We also demonstrated that for an image sequence of 720×480 pixels in resolution, commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards. PMID:22347787
Parallel algorithms for matrix computations
Plemmons, R.J.
1990-01-01
The present conference on parallel algorithms for matrix computations encompasses both shared-memory systems and distributed-memory systems, as well as combinations of the two, to provide an overall perspective on parallel algorithms for both dense and sparse matrix computations in solving systems of linear equations, dense or structured problems related to least-squares computations, eigenvalue computations, singular-value computations, and rapid elliptic solvers. Specific issues addressed include the influence of parallel and vector architectures on algorithm design, computations for distributed-memory architectures such as hypercubes, solutions for sparse symmetric positive definite linear systems, symbolic and numeric factorizations, and triangular solutions. Also addressed are reference sources for parallel and vector numerical algorithms, sources for machine architectures, and sources for programming languages.
The flight telerobotic servicer: From functional architecture to computer architecture
NASA Technical Reports Server (NTRS)
Lumia, Ronald; Fiala, John
1989-01-01
After a brief tutorial on the NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) functional architecture, the approach to its implementation is shown. First, interfaces must be defined which are capable of supporting the known algorithms. This is illustrated by considering the interfaces required for the SERVO level of the NASREM functional architecture. After interface definition, the specific computer architecture for the implementation must be determined. This choice is obviously technology dependent. An example illustrating one possible mapping of the NASREM functional architecture to a particular set of computers which implements it is shown. The result of choosing the NASREM functional architecture is that it provides a technology independent paradigm which can be mapped into a technology dependent implementation capable of evolving with technology in the laboratory and in space.
Mapping algorithms on regular parallel architectures
Lee, P.
1989-01-01
It is significant that many of time-intensive scientific algorithms are formulated as nested loops, which are inherently regularly structured. In this dissertation the relations between the mathematical structure of nested loop algorithms and the architectural capabilities required for their parallel execution are studied. The architectural model considered in depth is that of an arbitrary dimensional systolic array. The mathematical structure of the algorithm is characterized by classifying its data-dependence vectors according to the new ZERO-ONE-INFINITE property introduced. Using this classification, the first complete set of necessary and sufficient conditions for correct transformation of a nested loop algorithm onto a given systolic array of an arbitrary dimension by means of linear mappings is derived. Practical methods to derive optimal or suboptimal systolic array implementations are also provided. The techniques developed are used constructively to develop families of implementations satisfying various optimization criteria and to design programmable arrays efficiently executing classes of algorithms. In addition, a Computer-Aided Design system running on SUN workstations has been implemented to help in the design. The methodology, which deals with general algorithms, is illustrated by synthesizing linear and planar systolic array algorithms for matrix multiplication, a reindexed Warshall-Floyd transitive closure algorithm, and the longest common subsequence algorithm.
Spectral element methods - Algorithms and architectures
NASA Technical Reports Server (NTRS)
Fischer, Paul; Ronquist, Einar M.; Dewey, Daniel; Patera, Anthony T.
1988-01-01
Spectral element methods are high-order weighted residual techniques for partial differential equations that combine the geometric flexibility of finite element methods with the rapid convergence of spectral techniques. Spectral element methods are described for the simulation of incompressible fluid flows, with special emphasis on implementation of spectral element techniques on medium-grained parallel processors. Two parallel architectures are considered; the first, a commercially available message-passing hypercube system; the second, a developmental reconfigurable architecture based on Geometry-Defining Processors. High parallel efficiency is obtained in hypercube spectral element computations, indicating that load balancing and communication issues can be successfully addressed by a high-order technique/medium-grained processor algorithm-architecture coupling.
Spectral element methods: Algorithms and architectures
NASA Technical Reports Server (NTRS)
Fischer, Paul; Ronquist, Einar M.; Dewey, Daniel; Patera, Anthony T.
1988-01-01
Spectral element methods are high-order weighted residual techniques for partial differential equations that combine the geometric flexibility of finite element methods with the rapid convergence of spectral techniques. Spectral element methods are described for the simulation of incompressible fluid flows, with special emphasis on implementation of spectral element techniques on medium-grained parallel processors. Two parallel architectures are considered: the first, a commercially available message-passing hypercube system; the second, a developmental reconfigurable architecture based on Geometry-Defining Processors. High parallel efficiency is obtained in hypercube spectral element computations, indicating that load balancing and communication issues can be successfully addressed by a high-order technique/medium-grained processor algorithm-architecture coupling.
Monte Carlo simulations on SIMD computer architectures
Burmester, C.P.; Gronsky, R.; Wille, L.T.
1992-03-01
Algorithmic considerations regarding the implementation of various materials science applications of the Monte Carlo technique to single instruction multiple data (SMM) computer architectures are presented. In particular, implementation of the Ising model with nearest, next nearest, and long range screened Coulomb interactions on the SIMD architecture MasPar MP-1 (DEC mpp-12000) series of massively parallel computers is demonstrated. Methods of code development which optimize processor array use and minimize inter-processor communication are presented including lattice partitioning and the use of processor array spanning tree structures for data reduction. Both geometric and algorithmic parallel approaches are utilized. Benchmarks in terms of Monte Carlo updates per second for the MasPar architecture are presented and compared to values reported in the literature from comparable studies on other architectures.
Neural algorithms on VLSI concurrent architectures
Caviglia, D.D.; Bisio, G.M.; Parodi, G.
1988-09-01
The research concerns the study of neural algorithms for developing CAD tools with A.I. features in VLSI design activities. In this paper the focus is on optimization problems such as partitioning, placement and routing. These problems require massive computational power to be solved (NP-complete problems) and the standard approach is usually based on euristic techniques. Neural algorithms can be represented by a circuital model. This kind of representation can be easily mapped in a real circuit, which, however, features limited flexibility with respect to the variety of problems. In this sense the simulation of the neural circuit, by mapping it on a digital VLSI concurrent architecture seems to be preferrable; in addition this solution offers a wider choice with regard to algorithms characteristics (e.g. transfer curve of neural elements, reconfigurability of interconnections, etc.). The implementation with programmable components, such as transputers, allows an indirect mapping of the algorithm (one transputer for N neurons) accordingly to the dimension and the characteristics of the problem. In this way the neural algorithm described by the circuit is reduced to the algorithm that simulates the network behavior. The convergence properties of that formulation are studied with respect to the characteristics of the neural element transfer curve.
Stereoscopic depth perception for robot vision: algorithms and architectures
Safranek, R.J.; Kak, A.C.
1983-01-01
The implementation of depth perception algorithms for computer vision is considered. In automated manufacturing, depth information is vital for tasks such as path planning and 3-d scene analysis. The presentation begins with a survey of computer algorithms for stereoscopic depth perception. The emphasis is on the Marr-Poggio paradigm of human stereo vision and its computer implementation. In addition, a stereo matching algorithm based on the relaxation labelling technique is examined. A computer architecture designed to efficiently implement stereo matching algorithms, an MIMD array interfaced to a global memory, is presented. 9 references.
Parallel algorithms and architectures for the manipulator inertia matrix
Amin-Javaheri, M.
1989-01-01
Several parallel algorithms and architectures to compute the manipulator inertia matrix in real time are proposed. An O(N) and an O(log{sub 2}N) parallel algorithm based upon recursive computation of the inertial parameters of sets of composite rigid bodies are formulated. One- and two-dimensional systolic architectures are presented to implement the O(N) parallel algorithm. A cube architecture is employed to implement the diagonal element of the inertia matrix in O(log{sub 2}N) time and the upper off-diagonal elements in O(N) time. The resulting K{sub 1}O(N) + K{sub 2}O(log{sub 2}N) parallel algorithm is more efficient for a cube network implementation. All the architectural configurations are based upon a VLSI Robotics Processor exploiting fine-grain parallelism. In evaluation all the architectural configurations, significant performance parameters such as I/O time and idle time due to processor synchronization as well as CPU utilization and on-chip memory size are fully included. The O(N) and O(log{sub 2}N) parallel algorithms adhere to the precedence relationships among the processors. In order to achieve a higher speedup factor; however, parallel algorithms in conjunction with Non-Strict Computational Models are devised to relax interprocess precedence, and as a result, to decrease the effective computational delays. The effectiveness of the Non-strict Computational Algorithms is verified by computer simulations, based on a PUMA 560 robot manipulator. It is demonstrated that a combination of parallel algorithms and architectures results in a very effective approach to achieve real-time response for computing the manipulator inertia matrix.
Computing architecture for autonomous microgrids
Goldsmith, Steven Y.
2015-09-29
A computing architecture that facilitates autonomously controlling operations of a microgrid is described herein. A microgrid network includes numerous computing devices that execute intelligent agents, each of which is assigned to a particular entity (load, source, storage device, or switch) in the microgrid. The intelligent agents can execute in accordance with predefined protocols to collectively perform computations that facilitate uninterrupted control of the .
Flexible surveillance system architecture for prototyping video content analysis algorithms
NASA Astrophysics Data System (ADS)
Wijnhoven, R. G. J.; Jaspers, E. G. T.; de With, P. H. N.
2006-01-01
Many proposed video content analysis algorithms for surveillance applications are very computationally intensive, which limits the integration in a total system, running on one processing unit (e.g. PC). To build flexible prototyping systems of low cost, a distributed system with scalable processing power is therefore required. This paper discusses requirements for surveillance systems, considering two example applications. From these requirements, specifications for a prototyping architecture are derived. An implementation of the proposed architecture is presented, enabling mapping of multiple software modules onto a number of processing units (PCs). The architecture enables fast prototyping of new algorithms for complex surveillance applications without considering resource constraints.
Chang, C.Y.
1986-01-01
New results on efficient forms of decoding convolutional codes based on Viterbi and stack algorithms using systolic array architecture are presented. Some theoretical aspects of systolic arrays are also investigated. First, systolic array implementation of Viterbi algorithm is considered, and various properties of convolutional codes are derived. A technique called strongly connected trellis decoding is introduced to increase the efficient utilization of all the systolic array processors. The issues dealing with the composite branch metric generation, survivor updating, overall system architecture, throughput rate, and computations overhead ratio are also investigated. Second, the existing stack algorithm is modified and restated in a more concise version so that it can be efficiently implemented by a special type of systolic array called systolic priority queue. Three general schemes of systolic priority queue based on random access memory, shift register, and ripple register are proposed. Finally, a systematic approach is presented to design systolic arrays for certain general classes of recursively formulated algorithms.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Mapping robust parallel multigrid algorithms to scalable memory architectures
NASA Technical Reports Server (NTRS)
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than line relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. The parallel implementation of a V-cycle multiple semi-coarsened grid (MSG) algorithm or distributed-memory architectures such as the Intel iPSC/860 and Paragon computers is addressed. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. A mapping of an MSG algorithm to distributed-memory architectures that demonstrate how both levels of parallelism can be exploited is described. The results is a robust and effective multigrid algorithm for distributed-memory machines.
NASA Astrophysics Data System (ADS)
Basu, S.; Ganguly, S.; Nemani, R. R.; Mukhopadhyay, S.; Milesi, C.; Votava, P.; Michaelis, A.; Zhang, G.; Cook, B. D.; Saatchi, S. S.; Boyda, E.
2014-12-01
Accurate tree cover delineation is a useful instrument in the derivation of Above Ground Biomass (AGB) density estimates from Very High Resolution (VHR) satellite imagery data. Numerous algorithms have been designed to perform tree cover delineation in high to coarse resolution satellite imagery, but most of them do not scale to terabytes of data, typical in these VHR datasets. In this paper, we present an automated probabilistic framework for the segmentation and classification of 1-m VHR data as obtained from the National Agriculture Imagery Program (NAIP) for deriving tree cover estimates for the whole of Continental United States, using a High Performance Computing Architecture. The results from the classification and segmentation algorithms are then consolidated into a structured prediction framework using a discriminative undirected probabilistic graphical model based on Conditional Random Field (CRF), which helps in capturing the higher order contextual dependencies between neighboring pixels. Once the final probability maps are generated, the framework is updated and re-trained by incorporating expert knowledge through the relabeling of misclassified image patches. This leads to a significant improvement in the true positive rates and reduction in false positive rates. The tree cover maps were generated for the state of California, which covers a total of 11,095 NAIP tiles and spans a total geographical area of 163,696 sq. miles. Our framework produced correct detection rates of around 85% for fragmented forests and 70% for urban tree cover areas, with false positive rates lower than 3% for both regions. Comparative studies with the National Land Cover Data (NLCD) algorithm and the LiDAR high-resolution canopy height model shows the effectiveness of our algorithm in generating accurate high-resolution tree cover maps.
Optical linear algebra processors - Architectures and algorithms
NASA Technical Reports Server (NTRS)
Casasent, David
1986-01-01
Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.
Fast semivariogram computation using FPGA architectures
NASA Astrophysics Data System (ADS)
Lagadapati, Yamuna; Shirvaikar, Mukul; Dong, Xuanliang
2015-02-01
The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is O(n2). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments
Acoustic simulation in architecture with parallel algorithm
NASA Astrophysics Data System (ADS)
Li, Xiaohong; Zhang, Xinrong; Li, Dan
2004-03-01
In allusion to complexity of architecture environment and Real-time simulation of architecture acoustics, a parallel radiosity algorithm was developed. The distribution of sound energy in scene is solved with this method. And then the impulse response between sources and receivers at frequency segment, which are calculated with multi-process, are combined into whole frequency response. The numerical experiment shows that parallel arithmetic can improve the acoustic simulating efficiency of complex scene.
Parallel algorithms and architectures for very fast AI search
Gu, J.
1989-01-01
A wide range of problems in natural and artificial intelligence, computer vision, computer graphics, database engineering, operations research, symbolic logic, robot manipulation and hardware design automation are special cases of Consistent Labeling Problems (CLP). CLP has long been viewed as an efficient computational model based on a unit constraint relation containing 2N-tuples of units and labels which specifies which N-tuples of labels are compatible with which N-tuples of units. Due to high computation cost and design complexity, most currently best-known algorithms and computer architectures have usually proven infeasible for solving the consistent labeling problems. Efficiency in CLP computation during the last decade has only been improved a few times. This research presents several parallel algorithms and computer architectures for solving CLP within a parallel processing framework. For problems of practical interest, 4 to 10 orders of magnitude of efficiency improvement can be easily reached. Several simple wafer scale computer architectures are given which implement these parallel algorithms at a surprisingly low cost.
A Dualistic Model To Describe Computer Architectures
NASA Astrophysics Data System (ADS)
Nitezki, Peter; Engel, Michael
1985-07-01
The Dualistic Model for Computer Architecture Description uses a hierarchy of abstraction levels to describe a computer in arbitrary steps of refinement from the top of the user interface to the bottom of the gate level. In our Dualistic Model the description of an architecture may be divided into two major parts called "Concept" and "Realization". The Concept of an architecture on each level of the hierarchy is an Abstract Data Type that describes the functionality of the computer and an implementation of that data type relative to the data type of the next lower level of abstraction. The Realization on each level comprises a language describing the means of user interaction with the machine, and a processor interpreting this language in terms of the language of the lower level. The surface of each hierarchical level, the data type and the language express the behaviour of a ma-chine at this level, whereas the implementation and the processor describe the structure of the algorithms and the system. In this model the Principle of Operation maps the object and computational structure of the Concept onto the structures of the Realization. Describing a system in terms of the Dualistic Model is therefore a process of refinement starting at a mere description of behaviour and ending at a description of structure. This model has proven to be a very valuable tool in exploiting the parallelism in a problem and it is very transparent in discovering the points where par-allelism is lost in a special architecture. It has successfully been used in a project on a survey of Computer Architecture for Image Processing and Pattern Analysis in Germany.
Scalable computer architecture for digital vascular systems
NASA Astrophysics Data System (ADS)
Goddard, Iain; Chao, Hui; Skalabrin, Mark
1998-06-01
Digital vascular computer systems are used for radiology and fluoroscopy (R/F), angiography, and cardiac applications. In the United States alone, about 26 million procedures of these types are performed annually: about 81% R/F, 11% cardiac, and 8% angiography. Digital vascular systems have a very wide range of performance requirements, especially in terms of data rates. In addition, new features are added over time as they are shown to be clinically efficacious. Application-specific processing modes such as roadmapping, peak opacification, and bolus chasing are particular to some vascular systems. New algorithms continue to be developed and proven, such as Cox and deJager's precise registration methods for masks and live images in digital subtraction angiography. A computer architecture must have high scalability and reconfigurability to meet the needs of this modality. Ideally, the architecture could also serve as the basis for a nonvascular R/F system.
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Donaldson, Thomas; Doty, Karl; Engle, Steven W.; Larson, Robert E.; O'Reilly, John G.
1988-01-01
The results of a Phase I Small Business Innovative Research (SBIR) project performed for the NASA Langley Computational Structural Mechanics Group are described. The project resulted in the identification of a family of chordal-ring interconnection architectures with excellent potential to serve as the basis for new multimicroprocessor (MMP) computers. The paper presents examples of how computational algorithms from structural mechanics can be efficiently implemented on the chordal-ring architecture.
Gropp, William D.
2014-06-23
With the coming end of Moore's law, it has become essential to develop new algorithms and techniques that can provide the performance needed by demanding computational science applications, especially those that are part of the DOE science mission. This work was part of a multi-institution, multi-investigator project that explored several approaches to develop algorithms that would be effective at the extreme scales and with the complex processor architectures that are expected at the end of this decade. The work by this group developed new performance models that have already helped guide the development of highly scalable versions of an algebraic multigrid solver, new programming approaches designed to support numerical algorithms on heterogeneous architectures, and a new, more scalable version of conjugate gradient, an important algorithm in the solution of very large linear systems of equations.
Proceedings of the 13th annual international symposium on computer architecture
Not Available
1986-01-01
This book presents the papers given at a symposium which considered supercomputers and array processors. Topics covered at the symposium included knowledge bases, parallel architectures, artificial intelligence, VLSI, parallel algorithms, memory requirements, graph allocation, software implementation, performance analysis, parallel Prolog architectures, interconnections, Lisp machines, special purpose architectures, dataflow architectures, FFT machines, CPU architectures, matrix computation architectures, image processing, cache memory, pipeline architectures, and cache coherence.
Architectural Implications for Spatial Object Association Algorithms*
Kumar, Vijay S.; Kurc, Tahsin; Saltz, Joel; Abdulla, Ghaleb; Kohn, Scott R.; Matarazzo, Celeste
2013-01-01
Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server®, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST). PMID:25692244
Architectural Implications for Spatial Object Association Algorithms
Kumar, V S; Kurc, T; Saltz, J; Abdulla, G; Kohn, S R; Matarazzo, C
2009-01-29
Spatial object association, also referred to as cross-match of spatial datasets, is the problem of identifying and comparing objects in two or more datasets based on their positions in a common spatial coordinate system. In this work, we evaluate two crossmatch algorithms that are used for astronomical sky surveys, on the following database system architecture configurations: (1) Netezza Performance Server R, a parallel database system with active disk style processing capabilities, (2) MySQL Cluster, a high-throughput network database system, and (3) a hybrid configuration consisting of a collection of independent database system instances with data replication support. Our evaluation provides insights about how architectural characteristics of these systems affect the performance of the spatial crossmatch algorithms. We conducted our study using real use-case scenarios borrowed from a large-scale astronomy application known as the Large Synoptic Survey Telescope (LSST).
Parallel Architectures and Parallel Algorithms for Integrated Vision Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi
1989-01-01
Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems.
NASA Technical Reports Server (NTRS)
Metcalfe, A. G.; Bodenheimer, R. E.
1976-01-01
A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.
A heterogeneous hierarchical architecture for real-time computing
Skroch, D.A.; Fornaro, R.J.
1988-12-01
The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
Algorithmes et architectures pour ordinateurs quantiques supraconducteurs
NASA Astrophysics Data System (ADS)
Blais, A.
2003-09-01
Algorithms and architectures for superconducting quantum computers Since its formulation, information theory was based, implicitly, on the laws of classical physics. Such a formulation is however incomplete because it does not take into account quantum reality. During the last twenty years, expansion of theory information to include quantum effects has known growing interest. The practical realization of a system for quantum data processing system, a quantum computer, presents however many challenges. In this book, we are interested in various aspects of these challenges. We start by presenting algorithmic concepts like optimization of quantum computations and geometric quantum computation. We then consider various designs and aspects of qubits based on Josephson junctions. In particular, an original approach to the interaction between superconducting qubits is presented. This approach is very general since it can be applied to various designs of qubits. Finally, we are interested in read-out of the superconductic flux qubits. The detector suggested here has the advantage that it is possible to uncouple it from the qubit when no measurement is in progress. Depuis sa formulation, la théorie de l'information a été basée, implicitement, sur les lois de la physique classique. Une telle formulation est toutefois incomplète puisqu'elle ne tient pas compte de la réalité quantique. Au cours des vingt dernières années, l'expansion de la théorie de l'information, de façon à englober les effets purement quantiques, a connu un intérêt grandissant. La réalisation d'un système de traitement de l'information quantique, un ordinateur quantique, présente toutefois de nombreux défis. Dans cet ouvrage, on s'intéresse à différents aspects concernant ces défis. On commence par présenter des concepts algorithmiques comme l'optimisation de calculs quantiques et le calcul quantique géométrique. Par la suite, on s'intéresse à différents designs et aspects de l
A biconjugate gradient type algorithm on massively parallel architectures
NASA Technical Reports Server (NTRS)
Freund, Roland W.; Hochbruck, Marlis
1991-01-01
The biconjugate gradient (BCG) method is the natural generalization of the classical conjugate gradient algorithm for Hermitian positive definite matrices to general non-Hermitian linear systems. Unfortunately, the original BCG algorithm is susceptible to possible breakdowns and numerical instabilities. Recently, Freund and Nachtigal have proposed a novel BCG type approach, the quasi-minimal residual method (QMR), which overcomes the problems of BCG. Here, an implementation is presented of QMR based on an s-step version of the nonsymmetric look-ahead Lanczos algorithm. The main feature of the s-step Lanczos algorithm is that, in general, all inner products, except for one, can be computed in parallel at the end of each block; this is unlike the other standard Lanczos process where inner products are generated sequentially. The resulting implementation of QMR is particularly attractive on massively parallel SIMD architectures, such as the Connection Machine.
Computational Controls Workstation: Algorithms and hardware
NASA Technical Reports Server (NTRS)
Venugopal, R.; Kumar, M.
1993-01-01
The Computational Controls Workstation provides an integrated environment for the modeling, simulation, and analysis of Space Station dynamics and control. Using highly efficient computational algorithms combined with a fast parallel processing architecture, the workstation makes real-time simulation of flexible body models of the Space Station possible. A consistent, user-friendly interface and state-of-the-art post-processing options are combined with powerful analysis tools and model databases to provide users with a complete environment for Space Station dynamics and control analysis. The software tools available include a solid modeler, graphical data entry tool, O(n) algorithm-based multi-flexible body simulation, and 2D/3D post-processors. This paper describes the architecture of the workstation while a companion paper describes performance and user perspectives.
Computational Biology, Advanced Scientific Computing, and Emerging Computational Architectures
2007-06-27
This CRADA was established at the start of FY02 with $200 K from IBM and matching funds from DOE to support post-doctoral fellows in collaborative research between International Business Machines and Oak Ridge National Laboratory to explore effective use of emerging petascale computational architectures for the solution of computational biology problems. 'No cost' extensions of the CRADA were negotiated with IBM for FY03 and FY04.
Task scheduling in dataflow computer architectures
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
Dataflow computers provide a platform for the solution of a large class of computational problems, which includes digital signal processing and image processing. Many typical applications are represented by a set of tasks which can be repetitively executed in parallel as specified by an associated dataflow graph. Research in this area aims to model these architectures, develop scheduling procedures, and predict the transient and steady state performance. Researchers at NASA have created a model and developed associated software tools which are capable of analyzing a dataflow graph and predicting its runtime performance under various resource and timing constraints. These models and tools were extended and used in this work. Experiments using these tools revealed certain properties of such graphs that require further study. Specifically, the transient behavior at the beginning of the execution of a graph can have a significant effect on the steady state performance. Transformation and retiming of the application algorithm and its initial conditions can produce a different transient behavior and consequently different steady state performance. The effect of such transformations on the resource requirements or under resource constraints requires extensive study. Task scheduling to obtain maximum performance (based on user-defined criteria), or to satisfy a set of resource constraints, can also be significantly affected by a transformation of the application algorithm. Since task scheduling is performed by heuristic algorithms, further research is needed to determine if new scheduling heuristics can be developed that can exploit such transformations. This work has provided the initial development for further long-term research efforts. A simulation tool was completed to provide insight into the transient and steady state execution of a dataflow graph. A set of scheduling algorithms was completed which can operate in conjunction with the modeling and performance tools
A high performance parallel computing architecture for robust image features
NASA Astrophysics Data System (ADS)
Zhou, Renyan; Liu, Leibo; Wei, Shaojun
2014-03-01
A design of parallel architecture for image feature detection and description is proposed in this article. The major component of this architecture is a 2D cellular network composed of simple reprogrammable processors, enabling the Hessian Blob Detector and Haar Response Calculation, which are the most computing-intensive stage of the Speeded Up Robust Features (SURF) algorithm. Combining this 2D cellular network and dedicated hardware for SURF descriptors, this architecture achieves real-time image feature detection with minimal software in the host processor. A prototype FPGA implementation of the proposed architecture achieves 1318.9 GOPS general pixel processing @ 100 MHz clock and achieves up to 118 fps in VGA (640 × 480) image feature detection. The proposed architecture is stand-alone and scalable so it is easy to be migrated into VLSI implementation.
Parallel Computing Strategies for Irregular Algorithms
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)
2002-01-01
Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
Fibonacci Numbers and Computer Algorithms.
ERIC Educational Resources Information Center
Atkins, John; Geist, Robert
1987-01-01
The Fibonacci Sequence describes a vast array of phenomena from nature. Computer scientists have discovered and used many algorithms which can be classified as applications of Fibonacci's sequence. In this article, several of these applications are considered. (PK)
Computer algorithm for coding gain
NASA Technical Reports Server (NTRS)
Dodd, E. E.
1974-01-01
Development of a computer algorithm for coding gain for use in an automated communications link design system. Using an empirical formula which defines coding gain as used in space communications engineering, an algorithm is constructed on the basis of available performance data for nonsystematic convolutional encoding with soft-decision (eight-level) Viterbi decoding.
SCORPIUS algorithm benchmarks on the image understanding architecture machine
NASA Astrophysics Data System (ADS)
Bogdanowicz, Julius F.; Nash, J. Gregory; Shu, David B.
1992-04-01
Many Hughes tactical and strategic programs need high performance image processing. For example, photo-interpretation applications can require up to four orders of magnitude speedup over conventional computer architectures. Therefore, parallel processing systems are needed to help close the processing gap. Vision applications can usually be decomposed into three levels of processing called high, intermediate, and low level vision. Each processing level typically requires different types of numeric/symbolic computation, processing task granularities, and communications bandwidths. No parallel processing system is commercially available that is optimized for the entire range of computations. To meet these processing challenges, the image understanding architecture (IUA) has been developed by Hughes in collaboration with the University of Massachusetts. The IUA is a heterogeneous, hierarchical, associative parallel processor that is organized in three levels corresponding to the vision problem. Its lowest level consists of a large content addressable array parallel processor. This array of 'per pixel' bit serial processors is used for fixed point, low level numeric, and symbolic computations. The middle level is an interface communications array processor (ICAP). ICAP is an array of digital signal processing chips from TI TMS320Cx line, used for high speed number crunching. The highest level is the symbolic processing array. It is an array of general purpose microprocessors in which the artificial intelligence content of the image understanding software resides. A set of benchmarks from the DARPA/ORD sponsored SCORPIUS program were developed using the IUA. The set of algorithms included low level image processing as well as high level matching algorithms. Benchmark performance on the second generation IUA hardware is over four orders of magnitude faster than equivalent algorithms implemented on a DEC VAX 8650. The first generation hardware is operational. Development
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1991-01-01
The Theory of Intelligent Machines proposes a hierarchical organization for the functions of an autonomous robot based on the Principle of Increasing Precision With Decreasing Intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed in recent years. A computer architecture that implements the lower two levels of the intelligent machine is presented. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Details of Execution Level controllers for motion and vision systems are addressed, as well as the Petri net transducer software used to implement Coordination Level functions. Extensions to UNIX and VxWorks operating systems which enable the development of a heterogeneous, distributed application are described. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
A computer architecture for intelligent machines
NASA Technical Reports Server (NTRS)
Lefebvre, D. R.; Saridis, G. N.
1992-01-01
The theory of intelligent machines proposes a hierarchical organization for the functions of an autonomous robot based on the principle of increasing precision with decreasing intelligence. An analytic formulation of this theory using information-theoretic measures of uncertainty for each level of the intelligent machine has been developed. The authors present a computer architecture that implements the lower two levels of the intelligent machine. The architecture supports an event-driven programming paradigm that is independent of the underlying computer architecture and operating system. Execution-level controllers for motion and vision systems are briefly addressed, as well as the Petri net transducer software used to implement coordination-level functions. A case study illustrates how this computer architecture integrates real-time and higher-level control of manipulator and vision systems.
Switching from Computer to Microcomputer Architecture Education
ERIC Educational Resources Information Center
Bolanakis, Dimosthenis E.; Kotsis, Konstantinos T.; Laopoulos, Theodore
2010-01-01
In the last decades, the technological and scientific evolution of the computing discipline has been widely affecting research in software engineering education, which nowadays advocates more enlightened and liberal ideas. This article reviews cross-disciplinary research on a computer architecture class in consideration of its switching to…
Computer Electromagnetics and Supercomputer Architecture
NASA Technical Reports Server (NTRS)
Cwik, Tom
1993-01-01
The dramatic increase in performance over the last decade for microporcessor computations is compared with that for the supercomputer computations. This performance, the projected performance, and a number of other issues such as cost and the inherent pysical limitations in curent supercomputer technology have naturally led to parallel supercomputers and ensemble of interconnected microprocessors.
A VLSI architecture for simplified arithmetic Fourier transform algorithm
NASA Technical Reports Server (NTRS)
Reed, Irving S.; Shih, Ming-Tang; Truong, T. K.; Hendon, E.; Tufts, D. W.
1992-01-01
The arithmetic Fourier transform (AFT) is a number-theoretic approach to Fourier analysis which has been shown to perform competitively with the classical FFT in terms of accuracy, complexity, and speed. Theorems developed in a previous paper for the AFT algorithm are used here to derive the original AFT algorithm which Bruns found in 1903. This is shown to yield an algorithm of less complexity and of improved performance over certain recent AFT algorithms. A VLSI architecture is suggested for this simplified AFT algorithm. This architecture uses a butterfly structure which reduces the number of additions by 25 percent of that used in the direct method.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1987-01-01
The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
FFT Computation with Systolic Arrays, A New Architecture
NASA Technical Reports Server (NTRS)
Boriakoff, Valentin
1994-01-01
The use of the Cooley-Tukey algorithm for computing the l-d FFT lends itself to a particular matrix factorization which suggests direct implementation by linearly-connected systolic arrays. Here we present a new systolic architecture that embodies this algorithm. This implementation requires a smaller number of processors and a smaller number of memory cells than other recent implementations, as well as having all the advantages of systolic arrays. For the implementation of the decimation-in-frequency case, word-serial data input allows continuous real-time operation without the need of a serial-to-parallel conversion device. No control or data stream switching is necessary. Computer simulation of this architecture was done in the context of a 1024 point DFT with a fixed point processor, and CMOS processor implementation has started.
The new landscape of parallel computer architecture
NASA Astrophysics Data System (ADS)
Shalf, John
2007-07-01
The past few years has seen a sea change in computer architecture that will impact every facet of our society as every electronic device from cell phone to supercomputer will need to confront parallelism of unprecedented scale. Whereas the conventional multicore approach (2, 4, and even 8 cores) adopted by the computing industry will eventually hit a performance plateau, the highest performance per watt and per chip area is achieved using manycore technology (hundreds or even thousands of cores). However, fully unleashing the potential of the manycore approach to ensure future advances in sustained computational performance will require fundamental advances in computer architecture and programming models that are nothing short of reinventing computing. In this paper we examine the reasons behind the movement to exponentially increasing parallelism, and its ramifications for system design, applications and programming models.
Associative Algorithms for Computational Creativity
ERIC Educational Resources Information Center
Varshney, Lav R.; Wang, Jun; Varshney, Kush R.
2016-01-01
Computational creativity, the generation of new, unimagined ideas or artifacts by a machine that are deemed creative by people, can be applied in the culinary domain to create novel and flavorful dishes. In fact, we have done so successfully using a combinatorial algorithm for recipe generation combined with statistical models for recipe ranking…
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.; Som, Sukhamony
1990-01-01
The performance modeling and enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures is examined. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called ATAMM (Algorithm To Architecture Mapping Model). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Som, Sukhamoy; Stoughton, John W.; Mielke, Roland R.
1990-01-01
Performance modeling and performance enhancement for periodic execution of large-grain, decision-free algorithms in data flow architectures are discussed. Applications include real-time implementation of control and signal processing algorithms where performance is required to be highly predictable. The mapping of algorithms onto the specified class of data flow architectures is realized by a marked graph model called algorithm to architecture mapping model (ATAMM). Performance measures and bounds are established. Algorithm transformation techniques are identified for performance enhancement and reduction of resource (computing element) requirements. A systematic design procedure is described for generating operating conditions for predictable performance both with and without resource constraints. An ATAMM simulator is used to test and validate the performance prediction by the design procedure. Experiments on a three resource testbed provide verification of the ATAMM model and the design procedure.
The Fermilab Central Computing Facility architectural model
Nicholls, J.
1989-05-01
The goal of the current Central Computing Upgrade at Fermilab is to create a computing environment that maximizes total productivity, particularly for high energy physics analysis. The Computing Department and the Next Computer Acquisition Committee decided upon a model which includes five components: an interactive front end, a Large-Scale Scientific Computer (LSSC, a mainframe computing engine), a microprocessor farm system, a file server, and workstations. With the exception of the file server, all segments of this model are currently in production: a VAX/VMS Cluster interactive front end, an Amdahl VM computing engine, ACP farms, and (primarily) VMS workstations. This presentation will discuss the implementation of the Fermilab Central Computing Facility Architectural Model. Implications for Code Management in such a heterogeneous environment, including issues such as modularity and centrality, will be considered. Special emphasis will be placed on connectivity and communications between the front-end, LSSC, and workstations, as practiced at Fermilab. 2 figs.
Quantum perceptron over a field and neural network architecture selection in a quantum computer.
da Silva, Adenilton José; Ludermir, Teresa Bernarda; de Oliveira, Wilson Rosa
2016-04-01
In this work, we propose a quantum neural network named quantum perceptron over a field (QPF). Quantum computers are not yet a reality and the models and algorithms proposed in this work cannot be simulated in actual (or classical) computers. QPF is a direct generalization of a classical perceptron and solves some drawbacks found in previous models of quantum perceptrons. We also present a learning algorithm named Superposition based Architecture Learning algorithm (SAL) that optimizes the neural network weights and architectures. SAL searches for the best architecture in a finite set of neural network architectures with linear time over the number of patterns in the training set. SAL is the first learning algorithm to determine neural network architectures in polynomial time. This speedup is obtained by the use of quantum parallelism and a non-linear quantum operator. PMID:26878722
Quantum perceptron over a field and neural network architecture selection in a quantum computer.
da Silva, Adenilton José; Ludermir, Teresa Bernarda; de Oliveira, Wilson Rosa
2016-04-01
In this work, we propose a quantum neural network named quantum perceptron over a field (QPF). Quantum computers are not yet a reality and the models and algorithms proposed in this work cannot be simulated in actual (or classical) computers. QPF is a direct generalization of a classical perceptron and solves some drawbacks found in previous models of quantum perceptrons. We also present a learning algorithm named Superposition based Architecture Learning algorithm (SAL) that optimizes the neural network weights and architectures. SAL searches for the best architecture in a finite set of neural network architectures with linear time over the number of patterns in the training set. SAL is the first learning algorithm to determine neural network architectures in polynomial time. This speedup is obtained by the use of quantum parallelism and a non-linear quantum operator.
Evaluation of Visual Computer Simulator for Computer Architecture Education
ERIC Educational Resources Information Center
Imai, Yoshiro; Imai, Masatoshi; Moritoh, Yoshio
2013-01-01
This paper presents trial evaluation of a visual computer simulator in 2009-2011, which has been developed to play some roles of both instruction facility and learning tool simultaneously. And it illustrates an example of Computer Architecture education for University students and usage of e-Learning tool for Assembly Programming in order to…
Highly parallel computer architecture for robotic computation
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Anta K. (Inventor)
1991-01-01
In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
ATCA for Machines-- Advanced Telecommunications Computing Architecture
Larsen, R.S.; /SLAC
2008-04-22
The Advanced Telecommunications Computing Architecture is a new industry open standard for electronics instrument modules and shelves being evaluated for the International Linear Collider (ILC). It is the first industrial standard designed for High Availability (HA). ILC availability simulations have shown clearly that the capabilities of ATCA are needed in order to achieve acceptable integrated luminosity. The ATCA architecture looks attractive for beam instruments and detector applications as well. This paper provides an overview of ongoing R&D including application of HA principles to power electronics systems.
Computer graphics in architecture and engineering
NASA Technical Reports Server (NTRS)
Greenberg, D. P.
1975-01-01
The present status of the application of computer graphics to the building profession or architecture and its relationship to other scientific and technical areas were discussed. It was explained that, due to the fragmented nature of architecture and building activities (in contrast to the aerospace industry), a comprehensive, economic utilization of computer graphics in this area is not practical and its true potential cannot now be realized due to the present inability of architects and structural, mechanical, and site engineers to rely on a common data base. Future emphasis will therefore have to be placed on a vertical integration of the construction process and effective use of a three-dimensional data base, rather than on waiting for any technological breakthrough in interactive computing.
Parallel computers and parallel algorithms for CFD: An introduction
NASA Astrophysics Data System (ADS)
Roose, Dirk; Vandriessche, Rafael
1995-10-01
This text presents a tutorial on those aspects of parallel computing that are important for the development of efficient parallel algorithms and software for computational fluid dynamics. We first review the main architectural features of parallel computers and we briefly describe some parallel systems on the market today. We introduce some important concepts concerning the development and the performance evaluation of parallel algorithms. We discuss how work load imbalance and communication costs on distributed memory parallel computers can be minimized. We present performance results for some CFD test cases. We focus on applications using structured and block structured grids, but the concepts and techniques are also valid for unstructured grids.
Efficient Universal Computing Architectures for Decoding Neural Activity
Rapoport, Benjamin I.; Turicchia, Lorenzo; Wattanapanitch, Woradorn; Davidson, Thomas J.; Sarpeshkar, Rahul
2012-01-01
The ability to decode neural activity into meaningful control signals for prosthetic devices is critical to the development of clinically useful brain– machine interfaces (BMIs). Such systems require input from tens to hundreds of brain-implanted recording electrodes in order to deliver robust and accurate performance; in serving that primary function they should also minimize power dissipation in order to avoid damaging neural tissue; and they should transmit data wirelessly in order to minimize the risk of infection associated with chronic, transcutaneous implants. Electronic architectures for brain– machine interfaces must therefore minimize size and power consumption, while maximizing the ability to compress data to be transmitted over limited-bandwidth wireless channels. Here we present a system of extremely low computational complexity, designed for real-time decoding of neural signals, and suited for highly scalable implantable systems. Our programmable architecture is an explicit implementation of a universal computing machine emulating the dynamics of a network of integrate-and-fire neurons; it requires no arithmetic operations except for counting, and decodes neural signals using only computationally inexpensive logic operations. The simplicity of this architecture does not compromise its ability to compress raw neural data by factors greater than . We describe a set of decoding algorithms based on this computational architecture, one designed to operate within an implanted system, minimizing its power consumption and data transmission bandwidth; and a complementary set of algorithms for learning, programming the decoder, and postprocessing the decoded output, designed to operate in an external, nonimplanted unit. The implementation of the implantable portion is estimated to require fewer than 5000 operations per second. A proof-of-concept, 32-channel field-programmable gate array (FPGA) implementation of this portion is consequently energy efficient
Roadmap to the SRS computing architecture
Johnson, A.
1994-07-05
This document outlines the major steps that must be taken by the Savannah River Site (SRS) to migrate the SRS information technology (IT) environment to the new architecture described in the Savannah River Site Computing Architecture. This document proposes an IT environment that is {open_quotes}...standards-based, data-driven, and workstation-oriented, with larger systems being utilized for the delivery of needed information to users in a client-server relationship.{close_quotes} Achieving this vision will require many substantial changes in the computing applications, systems, and supporting infrastructure at the site. This document consists of a set of roadmaps which provide explanations of the necessary changes for IT at the site and describes the milestones that must be completed to finish the migration.
NASA Technical Reports Server (NTRS)
Hsia, T. C.; Lu, G. Z.; Han, W. H.
1987-01-01
In advanced robot control problems, on-line computation of inverse Jacobian solution is frequently required. Parallel processing architecture is an effective way to reduce computation time. A parallel processing architecture is developed for the inverse Jacobian (inverse differential kinematic equation) of the PUMA arm. The proposed pipeline/parallel algorithm can be inplemented on an IC chip using systolic linear arrays. This implementation requires 27 processing cells and 25 time units. Computation time is thus significantly reduced.
DFT algorithms for bit-serial GaAs array processor architectures
NASA Technical Reports Server (NTRS)
Mcmillan, Gary B.
1988-01-01
Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Adaptive DNA Computing Algorithm by Using PCR and Restriction Enzyme
NASA Astrophysics Data System (ADS)
Kon, Yuji; Yabe, Kaoru; Rajaee, Nordiana; Ono, Osamu
In this paper, we introduce an adaptive DNA computing algorithm by using polymerase chain reaction (PCR) and restriction enzyme. The adaptive algorithm is designed based on Adleman-Lipton paradigm[3] of DNA computing. In this work, however, unlike the Adleman- Lipton architecture a cutting operation has been introduced to the algorithm and the mechanism in which the molecules used by computation were feedback to the next cycle devised. Moreover, the amplification by PCR is performed in the molecule used by feedback and the difference concentration arisen in the base sequence can be used again. By this operation the molecules which serve as a solution candidate can be reduced down and the optimal solution is carried out in the shortest path problem. The validity of the proposed adaptive algorithm is considered with the logical simulation and finally we go on to propose applying adaptive algorithm to the chemical experiment which used the actual DNA molecules for solving an optimal network problem.
NASA Technical Reports Server (NTRS)
Mielke, R.; Stoughton, J.; Som, S.; Obando, R.; Malekpour, M.; Mandala, B.
1990-01-01
A functional description of the ATAMM Multicomputer Operating System is presented. ATAMM (Algorithm to Architecture Mapping Model) is a marked graph model which describes the implementation of large grained, decomposed algorithms on data flow architectures. AMOS, the ATAMM Multicomputer Operating System, is an operating system which implements the ATAMM rules. A first generation version of AMOS which was developed for the Advanced Development Module (ADM) is described. A second generation version of AMOS being developed for the Generic VHSIC Spaceborne Computer (GVSC) is also presented.
Acoustooptic linear algebra processors - Architectures, algorithms, and applications
NASA Technical Reports Server (NTRS)
Casasent, D.
1984-01-01
Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
Frontal optimization algorithms for multiprocessor computers
Sergienko, I.V.; Gulyanitskii, L.F.
1981-11-01
The authors describe one of the approaches to the construction of locally optimal optimization algorithms on multiprocessor computers. Algorithms of this type, called frontal, have been realized previously on single-processor computers, although this configuration does not fully exploit the specific features of their computational scheme. Experience with a number of practical discrete optimization problems confirms that the frontal algorithms are highly successful even with single-processor computers. 9 references.
Performance evaluation of the SX-6 vector architecture forscientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan Carter; Shalf,John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri,Jahed; Van der Wijngaart, Rob
2005-01-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor, and compares it against the cache-based IBMPower3 and Power4 superscalar architectures, across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines many low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Overall results demonstrate that the SX-6 achieves high performance on a large fraction of our application suite and often significantly outperforms the cache-based architectures. However, certain classes of applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Job Superscheduler Architecture and Performance in Computational Grid Environments
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Oliker, Leonid; Biswas, Rupak
2003-01-01
Computational grids hold great promise in utilizing geographically separated heterogeneous resources to solve large-scale complex scientific problems. However, a number of major technical hurdles, including distributed resource management and effective job scheduling, stand in the way of realizing these gains. In this paper, we propose a novel grid superscheduler architecture and three distributed job migration algorithms. We also model the critical interaction between the superscheduler and autonomous local schedulers. Extensive performance comparisons with ideal, central, and local schemes using real workloads from leading computational centers are conducted in a simulation environment. Additionally, synthetic workloads are used to perform a detailed sensitivity analysis of our superscheduler. Several key metrics demonstrate that substantial performance gains can be achieved via smart superscheduling in distributed computational grids.
Parallel computer graphics algorithms for the Connection Machine
Richardson, J.F.
1990-01-01
Many of the classes of computer graphics algorithms and polygon storage schemes can be adapted for parallel execution on various parallel architectures. The connection machine is one such architecture that should be thought of as a multiprocessor grid that can be reconfigured into standard 2-dimensional mesh and n-dimensional hypercube architectures. The classes of algorithms considered in this paper are SPLINES; POLYGON STORAGE; TRIANGULARIZATION; and SYMBOLIC INPUT. The target Connection Machine (hearafter designated as CM) for the algorithms of this paper has 8192 physical processors. Each physical processor has 8 kilobytes of local memory plus an arithmetic-logic unit. All processors can communicate with any other processor through a router. Thus this CM has a shared memory of 64 megabytes when used as a standard multiprocessor (MIMD) architecture. In addition, the CM interconnection structure can simulate a 2-dimensional mesh and n-dimensional hypercube (SIMD) architecture with the mesh being the default architecture. The front end for the CM is a Symbolics and the high level language is LISP or FORTRAN.
Frances: A Tool for Understanding Computer Architecture and Assembly Language
ERIC Educational Resources Information Center
Sondag, Tyler; Pokorny, Kian L.; Rajan, Hridesh
2012-01-01
Students in all areas of computing require knowledge of the computing device including software implementation at the machine level. Several courses in computer science curricula address these low-level details such as computer architecture and assembly languages. For such courses, there are advantages to studying real architectures instead of…
Parallel algorithms for mapping pipelined and parallel computations
NASA Technical Reports Server (NTRS)
Nicol, David M.
1988-01-01
Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.
Hierarchical Poly Tree computer architectures defined by computational multidisciplinary mechanics
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug; Johnson, Keith
1989-01-01
This paper will develop an alternative computer architecture called the Poly Tree. Based on the requirements of computational mechanics and the concept of hierarchical substructuring, the paper will explore the development of problem-dependent parallel networks of processors which will enable significant, often superlinear, speed enhancements; provide a logical/efficient framework for linear/nonlinear and transient structural mechanics problems; and provide a logical framework from which to apply model reduction procedures. In addition, the paper will explore optimal processor arrangements which define the overall system granularity. Consideration will also be given to system I/O requirements.
Resource utilization model for the algorithm to architecture mapping model
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Patel, Rakesh R.
1993-01-01
The analytical model for resource utilization and the variable node time and conditional node model for the enhanced ATAMM model for a real-time data flow architecture are presented in this research. The Algorithm To Architecture Mapping Model, ATAMM, is a Petri net based graph theoretic model developed at Old Dominion University, and is capable of modeling the execution of large-grained algorithms on a real-time data flow architecture. Using the resource utilization model, the resource envelope may be obtained directly from a given graph and, consequently, the maximum number of required resources may be evaluated. The node timing diagram for one iteration period may be obtained using the analytical resource envelope. The variable node time model, which describes the change in resource requirement for the execution of an algorithm under node time variation, is useful to expand the applicability of the ATAMM model to heterogeneous architectures. The model also describes a method of detecting the presence of resource limited mode and its subsequent prevention. Graphs with conditional nodes are shown to be reduced to equivalent graphs with time varying nodes and, subsequently, may be analyzed using the variable node time model to determine resource requirements. Case studies are performed on three graphs for the illustration of applicability of the analytical theories.
Algorithms and architectures for robot vision
NASA Technical Reports Server (NTRS)
Schenker, Paul S.
1990-01-01
The scope of the current work is to develop practical sensing implementations for robots operating in complex, partially unstructured environments. A focus in this work is to develop object models and estimation techniques which are specific to requirements of robot locomotion, approach and avoidance, and grasp and manipulation. Such problems have to date received limited attention in either computer or human vision - in essence, asking not only how perception is in general modeled, but also what is the functional purpose of its underlying representations. As in the past, researchers are drawing on ideas from both the psychological and machine vision literature. Of particular interest is the development 3-D shape and motion estimates for complex objects when given only partial and uncertain information and when such information is incrementally accrued over time. Current studies consider the use of surface motion, contour, and texture information, with the longer range goal of developing a fused sensing strategy based on these sources and others.
Algorithms and architectures for high performance analysis of semantic graphs.
Hendrickson, Bruce Alan
2005-09-01
analysis. Since intelligence datasets can be extremely large, the focus of this work is on the use of parallel computers. We have been working to develop scalable parallel algorithms that will be at the core of a semantic graph analysis infrastructure. Our work has involved two different thrusts, corresponding to two different computer architectures. The first architecture of interest is distributed memory, message passing computers. These machines are ubiquitous and affordable, but they are challenging targets for graph algorithms. Much of our distributed-memory work to date has been collaborative with researchers at Lawrence Livermore National Laboratory and has focused on finding short paths on distributed memory parallel machines. Our implementation on 32K processors of BlueGene/Light finds shortest paths between two specified vertices in just over a second for random graphs with 4 billion vertices.
A High Performance COTS Based Computer Architecture
NASA Astrophysics Data System (ADS)
Patte, Mathieu; Grimoldi, Raoul; Trautner, Roland
2014-08-01
Using Commercial Off The Shelf (COTS) electronic components for space applications is a long standing idea. Indeed the difference in processing performance and energy efficiency between radiation hardened components and COTS components is so important that COTS components are very attractive for use in mass and power constrained systems. However using COTS components in space is not straightforward as one must account with the effects of the space environment on the COTS components behavior. In the frame of the ESA funded activity called High Performance COTS Based Computer, Airbus Defense and Space and its subcontractor OHB CGS have developed and prototyped a versatile COTS based architecture for high performance processing. The rest of the paper is organized as follows: in a first section we will start by recapitulating the interests and constraints of using COTS components for space applications; then we will briefly describe existing fault mitigation architectures and present our solution for fault mitigation based on a component called the SmartIO; in the last part of the paper we will describe the prototyping activities executed during the HiP CBC project.
QPSO-based adaptive DNA computing algorithm.
Karakose, Mehmet; Cigdem, Ugur
2013-01-01
DNA (deoxyribonucleic acid) computing that is a new computation model based on DNA molecules for information storage has been increasingly used for optimization and data analysis in recent years. However, DNA computing algorithm has some limitations in terms of convergence speed, adaptability, and effectiveness. In this paper, a new approach for improvement of DNA computing is proposed. This new approach aims to perform DNA computing algorithm with adaptive parameters towards the desired goal using quantum-behaved particle swarm optimization (QPSO). Some contributions provided by the proposed QPSO based on adaptive DNA computing algorithm are as follows: (1) parameters of population size, crossover rate, maximum number of operations, enzyme and virus mutation rate, and fitness function of DNA computing algorithm are simultaneously tuned for adaptive process, (2) adaptive algorithm is performed using QPSO algorithm for goal-driven progress, faster operation, and flexibility in data, and (3) numerical realization of DNA computing algorithm with proposed approach is implemented in system identification. Two experiments with different systems were carried out to evaluate the performance of the proposed approach with comparative results. Experimental results obtained with Matlab and FPGA demonstrate ability to provide effective optimization, considerable convergence speed, and high accuracy according to DNA computing algorithm.
Parallel language constructs for tensor product computations on loosely coupled architectures
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1989-01-01
Distributed memory architectures offer high levels of performance and flexibility, but have proven awkard to program. Current languages for nonshared memory architectures provide a relatively low level programming environment, and are poorly suited to modular programming, and to the construction of libraries. A set of language primitives designed to allow the specification of parallel numerical algorithms at a higher level is described. Tensor product array computations are focused on along with a simple but important class of numerical algorithms. The problem of programming 1-D kernal routines is focused on first, such as parallel tridiagonal solvers, and then how such parallel kernels can be combined to form parallel tensor product algorithms is examined.
A high throughput architecture for a low complexity soft-output demapping algorithm
NASA Astrophysics Data System (ADS)
Ali, I.; Wasenmüller, U.; Wehn, N.
2015-11-01
Iterative channel decoders such as Turbo-Code and LDPC decoders show exceptional performance and therefore they are a part of many wireless communication receivers nowadays. These decoders require a soft input, i.e., the logarithmic likelihood ratio (LLR) of the received bits with a typical quantization of 4 to 6 bits. For computing the LLR values from a received complex symbol, a soft demapper is employed in the receiver. The implementation cost of traditional soft-output demapping methods is relatively large in high order modulation systems, and therefore low complexity demapping algorithms are indispensable in low power receivers. In the presence of multiple wireless communication standards where each standard defines multiple modulation schemes, there is a need to have an efficient demapper architecture covering all the flexibility requirements of these standards. Another challenge associated with hardware implementation of the demapper is to achieve a very high throughput in double iterative systems, for instance, MIMO and Code-Aided Synchronization. In this paper, we present a comprehensive communication and hardware performance evaluation of low complexity soft-output demapping algorithms to select the best algorithm for implementation. The main goal of this work is to design a high throughput, flexible, and area efficient architecture. We describe architectures to execute the investigated algorithms. We implement these architectures on a FPGA device to evaluate their hardware performance. The work has resulted in a hardware architecture based on the figured out best low complexity algorithm delivering a high throughput of 166 Msymbols/second for Gray mapped 16-QAM modulation on Virtex-5. This efficient architecture occupies only 127 slice registers, 248 slice LUTs and 2 DSP48Es.
Algorithmic Mechanism Design of Evolutionary Computation
Pei, Yan
2015-01-01
We consider algorithmic design, enhancement, and improvement of evolutionary computation as a mechanism design problem. All individuals or several groups of individuals can be considered as self-interested agents. The individuals in evolutionary computation can manipulate parameter settings and operations by satisfying their own preferences, which are defined by an evolutionary computation algorithm designer, rather than by following a fixed algorithm rule. Evolutionary computation algorithm designers or self-adaptive methods should construct proper rules and mechanisms for all agents (individuals) to conduct their evolution behaviour correctly in order to definitely achieve the desired and preset objective(s). As a case study, we propose a formal framework on parameter setting, strategy selection, and algorithmic design of evolutionary computation by considering the Nash strategy equilibrium of a mechanism design in the search process. The evaluation results present the efficiency of the framework. This primary principle can be implemented in any evolutionary computation algorithm that needs to consider strategy selection issues in its optimization process. The final objective of our work is to solve evolutionary computation design as an algorithmic mechanism design problem and establish its fundamental aspect by taking this perspective. This paper is the first step towards achieving this objective by implementing a strategy equilibrium solution (such as Nash equilibrium) in evolutionary computation algorithm. PMID:26257777
A biomimetic adaptive algorithm and low-power architecture for implantable neural decoders.
Rapoport, Benjamin I; Wattanapanitch, Woradorn; Penagos, Hector L; Musallam, Sam; Andersen, Richard A; Sarpeshkar, Rahul
2009-01-01
Algorithmically and energetically efficient computational architectures that operate in real time are essential for clinically useful neural prosthetic devices. Such devices decode raw neural data to obtain direct control signals for external devices. They can also perform data compression and vastly reduce the bandwidth and consequently power expended in wireless transmission of raw data from implantable brain-machine interfaces. We describe a biomimetic algorithm and micropower analog circuit architecture for decoding neural cell ensemble signals. The decoding algorithm implements a continuous-time artificial neural network, using a bank of adaptive linear filters with kernels that emulate synaptic dynamics. The filters transform neural signal inputs into control-parameter outputs, and can be tuned automatically in an on-line learning process. We provide experimental validation of our system using neural data from thalamic head-direction cells in an awake behaving rat.
A Biomimetic Adaptive Algorithm and Low-Power Architecture for Implantable Neural Decoders
Rapoport, Benjamin I.; Wattanapanitch, Woradorn; Penagos, Hector L.; Musallam, Sam; Andersen, Richard A.; Sarpeshkar, Rahul
2010-01-01
Algorithmically and energetically efficient computational architectures that operate in real time are essential for clinically useful neural prosthetic devices. Such devices decode raw neural data to obtain direct control signals for external devices. They can also perform data compression and vastly reduce the bandwidth and consequently power expended in wireless transmission of raw data from implantable brain-machine interfaces. We describe a biomimetic algorithm and micropower analog circuit architecture for decoding neural cell ensemble signals. The decoding algorithm implements a continuous-time artificial neural network, using a bank of adaptive linear filters with kernels that emulate synaptic dynamics. The filters transform neural signal inputs into control-parameter outputs, and can be tuned automatically in an on-line learning process. We provide experimental validation of our system using neural data from thalamic head-direction cells in an awake behaving rat. PMID:19964345
NASA Technical Reports Server (NTRS)
Weeks, Cindy Lou
1986-01-01
Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.
Advanced computer architecture specification for automated weld systems
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1994-01-01
This report describes the requirements for an advanced automated weld system and the associated computer architecture, and defines the overall system specification from a broad perspective. According to the requirements of welding procedures as they relate to an integrated multiaxis motion control and sensor architecture, the computer system requirements are developed based on a proven multiple-processor architecture with an expandable, distributed-memory, single global bus architecture, containing individual processors which are assigned to specific tasks that support sensor or control processes. The specified architecture is sufficiently flexible to integrate previously developed equipment, be upgradable and allow on-site modifications.
Monte Carlo simulations on SIMD computer architectures. [Single instruction multiple data (SIMD)
Burmester, C.P.; Gronsky, R. ); Wille, L.T. . Dept. of Physics)
1992-03-01
Algorithmic considerations regarding the implementation of various materials science applications of the Monte Carlo technique to single instruction multiple data (SMM) computer architectures are presented. In particular, implementation of the Ising model with nearest, next nearest, and long range screened Coulomb interactions on the SIMD architecture MasPar MP-1 (DEC mpp-12000) series of massively parallel computers is demonstrated. Methods of code development which optimize processor array use and minimize inter-processor communication are presented including lattice partitioning and the use of processor array spanning tree structures for data reduction. Both geometric and algorithmic parallel approaches are utilized. Benchmarks in terms of Monte Carlo updates per second for the MasPar architecture are presented and compared to values reported in the literature from comparable studies on other architectures.
From Point Clouds to Architectural Models: Algorithms for Shape Reconstruction
NASA Astrophysics Data System (ADS)
Canciani, M.; Falcolini, C.; Saccone, M.; Spadafora, G.
2013-02-01
The use of terrestrial laser scanners in architectural survey applications has become more and more common. Row data complexity, as given by scanner restitution, leads to several problems about design and 3D-modelling starting from Point Clouds. In this context we present a study on architectural sections and mathematical algorithms for their shape reconstruction, according to known or definite geometrical rules, focusing on shapes of different complexity. Each step of the semi-automatic algorithm has been developed using Mathematica software and CAD, integrating both programs in order to reconstruct a geometrical CAD model of the object. Our study is motivated by the fact that, for architectural survey, most of three dimensional modelling procedures concerning point clouds produce superabundant, but often unnecessary, information and are also very expensive in terms of cpu time using more and more sophisticated hardware and software. On the contrary, it's important to simplify/decimate the point cloud in order to recognize a particular form out of some definite geometric/architectonic shapes. Such a process consists of several steps: first the definition of plane sections and characterization of their architecture; secondly the construction of a continuous plane curve depending on some parameters. In the third step we allow the selection on the curve of some nodal points with given specific characteristics (symmetry, tangency conditions, shadowing exclusion, corners, … ). The fourth and last step is the construction of a best shape defined by the comparison with an abacus of known geometrical elements, such as moulding profiles, leading to a precise architectonical section. The algorithms have been developed and tested in very different situations and are presented in a case study of complex geometries such as some mouldings profiles in the Church of San Carlo alle Quattro Fontane.
Compiling quantum algorithms for architectures with multi-qubit gates
NASA Astrophysics Data System (ADS)
Martinez, Esteban A.; Monz, Thomas; Nigg, Daniel; Schindler, Philipp; Blatt, Rainer
2016-06-01
In recent years, small-scale quantum information processors have been realized in multiple physical architectures. These systems provide a universal set of gates that allow one to implement any given unitary operation. The decomposition of a particular algorithm into a sequence of these available gates is not unique. Thus, the fidelity of the implementation of an algorithm can be increased by choosing an optimized decomposition into available gates. Here, we present a method to find such a decomposition, where a small-scale ion trap quantum information processor is used as an example. We demonstrate a numerical optimization protocol that minimizes the number of required multi-qubit entangling gates by design. Furthermore, we adapt the method for state preparation, and quantum algorithms including in-sequence measurements.
A Computational Fluid Dynamics Algorithm on a Massively Parallel Computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The discipline of computational fluid dynamics is demanding ever-increasing computational power to deal with complex fluid flow problems. We investigate the performance of a finite-difference computational fluid dynamics algorithm on a massively parallel computer, the Connection Machine. Of special interest is an implicit time-stepping algorithm; to obtain maximum performance from the Connection Machine, it is necessary to use a nonstandard algorithm to solve the linear systems that arise in the implicit algorithm. We find that the Connection Machine ran achieve very high computation rates on both explicit and implicit algorithms. The performance of the Connection Machine puts it in the same class as today's most powerful conventional supercomputers.
Data-parallel algorithms for image computing
NASA Astrophysics Data System (ADS)
Carlotto, Mark J.
1990-11-01
Data-parallel algorithms for image computing on the Connection Machine are described. After a brief review of some basic programming concepts in *Lip, a parallel extension of Common Lisp, data-parallel programming paradigms based on a local (diffusion-like) model of computation, the scan model of computation, a general interprocessor communications model, and a region-based model are introduced. Algorithms for connected component labeling, distance transformation, Voronoi diagrams, finding minimum cost paths, local means, shape-from-shading, hidden surface calculations, affine transformation, oblique parallel projection, and spatial operations over regions are presented. An new algorithm for interpolating irregularly spaced data via Voronoi diagrams is also described.
Rapid indirect trajectory optimization on highly parallel computing architectures
NASA Astrophysics Data System (ADS)
Antony, Thomas
Trajectory optimization is a field which can benefit greatly from the advantages offered by parallel computing. The current state-of-the-art in trajectory optimization focuses on the use of direct optimization methods, such as the pseudo-spectral method. These methods are favored due to their ease of implementation and large convergence regions while indirect methods have largely been ignored in the literature in the past decade except for specific applications in astrodynamics. It has been shown that the shortcomings conventionally associated with indirect methods can be overcome by the use of a continuation method in which complex trajectory solutions are obtained by solving a sequence of progressively difficult optimization problems. High performance computing hardware is trending towards more parallel architectures as opposed to powerful single-core processors. Graphics Processing Units (GPU), which were originally developed for 3D graphics rendering have gained popularity in the past decade as high-performance, programmable parallel processors. The Compute Unified Device Architecture (CUDA) framework, a parallel computing architecture and programming model developed by NVIDIA, is one of the most widely used platforms in GPU computing. GPUs have been applied to a wide range of fields that require the solution of complex, computationally demanding problems. A GPU-accelerated indirect trajectory optimization methodology which uses the multiple shooting method and continuation is developed using the CUDA platform. The various algorithmic optimizations used to exploit the parallelism inherent in the indirect shooting method are described. The resulting rapid optimal control framework enables the construction of high quality optimal trajectories that satisfy problem-specific constraints and fully satisfy the necessary conditions of optimality. The benefits of the framework are highlighted by construction of maximum terminal velocity trajectories for a hypothetical
Outline of a novel architecture for cortical computation.
Majumdar, Kaushik
2008-03-01
In this paper a novel architecture for cortical computation has been proposed. This architecture is composed of computing paths consisting of neurons and synapses. These paths have been decomposed into lateral, longitudinal and vertical components. Cortical computation has then been decomposed into lateral computation (LaC), longitudinal computation (LoC) and vertical computation (VeC). It has been shown that various loop structures in the cortical circuit play important roles in cortical computation as well as in memory storage and retrieval, keeping in conformity with the molecular basis of short and long term memory. A new learning scheme for the brain has also been proposed and how it is implemented within the proposed architecture has been explained. A few mathematical results about the architecture have been proposed, some of which are without proof.
Outline of a novel architecture for cortical computation
2007-01-01
In this paper a novel architecture for cortical computation has been proposed. This architecture is composed of computing paths consisting of neurons and synapses. These paths have been decomposed into lateral, longitudinal and vertical components. Cortical computation has then been decomposed into lateral computation (LaC), longitudinal computation (LoC) and vertical computation (VeC). It has been shown that various loop structures in the cortical circuit play important roles in cortical computation as well as in memory storage and retrieval, keeping in conformity with the molecular basis of short and long term memory. A new learning scheme for the brain has also been proposed and how it is implemented within the proposed architecture has been explained. A few mathematical results about the architecture have been proposed, some of which are without proof. PMID:19003474
Modular architecture for high performance implementation of the FFT algorithm
Spiecha, K. ); Jarocki, R. )
1990-12-01
This paper presents a new VLSI-oriented architecture to compute discrete Fourier transform. It consists of a homogeneous structure of processing elements. The structure has a performance equal to 1/{ital t} transforms per second, where {ital t} is the time needed for the execution of a single butterfly computation or the time needed for the collection of a complete vector of samples, whichever occurs to be longer. Although the system is not optimal (it achieves {ital O(N}{sup 3} log{sup 4} {ital N)} area time{sup 2} performance), the architecture is modular and makes it possible to design a system which performs FFT of any size without any extra circuitry. Moreover, the system can provide a built-in self-test and self-restructuring. The system consists of only one type of integrated circuit, its structure being irrespective of the transform size, which considerably reduces the cost of implementation.
Performance Analysis of Cloud Computing Architectures Using Discrete Event Simulation
NASA Technical Reports Server (NTRS)
Stocker, John C.; Golomb, Andrew M.
2011-01-01
Cloud computing offers the economic benefit of on-demand resource allocation to meet changing enterprise computing needs. However, the flexibility of cloud computing is disadvantaged when compared to traditional hosting in providing predictable application and service performance. Cloud computing relies on resource scheduling in a virtualized network-centric server environment, which makes static performance analysis infeasible. We developed a discrete event simulation model to evaluate the overall effectiveness of organizations in executing their workflow in traditional and cloud computing architectures. The two part model framework characterizes both the demand using a probability distribution for each type of service request as well as enterprise computing resource constraints. Our simulations provide quantitative analysis to design and provision computing architectures that maximize overall mission effectiveness. We share our analysis of key resource constraints in cloud computing architectures and findings on the appropriateness of cloud computing in various applications.
Petaflops Computing: The Key Algorithmic Challenges
NASA Technical Reports Server (NTRS)
Bailey, David H.; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
The prospect of petaflops-class computers brings to the fore some important algorithmic issues that have been considered in the high performance computing community for several years. Key among them are (1) concurrency (whether the fundamental concurrency of an algorithm is sufficient to keep thousands of processors productively busy); (2) data locality; (3) latency tolerance; and (4) memory and operation count scaling. This introductory presentation will give an overview of these issues.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Algorithms for computing the multivariable stability margin
NASA Technical Reports Server (NTRS)
Tekawy, Jonathan A.; Safonov, Michael G.; Chiang, Richard Y.
1989-01-01
Stability margin for multiloop flight control systems has become a critical issue, especially in highly maneuverable aircraft designs where there are inherent strong cross-couplings between the various feedback control loops. To cope with this issue, we have developed computer algorithms based on non-differentiable optimization theory. These algorithms have been developed for computing the Multivariable Stability Margin (MSM). The MSM of a dynamical system is the size of the smallest structured perturbation in component dynamics that will destabilize the system. These algorithms have been coded and appear to be reliable. As illustrated by examples, they provide the basis for evaluating the robustness and performance of flight control systems.
Pipeline and parallel architectures for computer communication systems
Reddi, A.V.
1983-01-01
Various existing communication precessor systems (CPSS) at different nodes in computer communication systems (CCSS) are reviewed for distributed processing systems. To meet the increasing load of messages, pipeline and parallel architectures are suggested in CPSS. Finally, pipeline, array, multi and multiple-processor architectures and their advantages in CPSS for CCSS are presented and analysed, and their performances are compared with the performance of uniprocessor architecture. 19 references.
Adaptive computation algorithm for RBF neural network.
Han, Hong-Gui; Qiao, Jun-Fei
2012-02-01
A novel learning algorithm is proposed for nonlinear modelling and identification using radial basis function neural networks. The proposed method simplifies neural network training through the use of an adaptive computation algorithm (ACA). In addition, the convergence of the ACA is analyzed by the Lyapunov criterion. The proposed algorithm offers two important advantages. First, the model performance can be significantly improved through ACA, and the modelling error is uniformly ultimately bounded. Secondly, the proposed ACA can reduce computational cost and accelerate the training speed. The proposed method is then employed to model classical nonlinear system with limit cycle and to identify nonlinear dynamic system, exhibiting the effectiveness of the proposed algorithm. Computational complexity analysis and simulation results demonstrate its effectiveness.
Parallel radiation transport algorithms and associated architectural requirements
Morel, J. E.; Baker, R. S.; Warsa, J. S.
2004-01-01
The radiation transport equation is a seven-dimensional equation that can be extremely expensive to solve. In general, transport can be expected to completely dominate the memory and CPU time requirements for the ASCI codes. Both traditional iterative transport solution methods and modern Krylov-subspace solution methods require the inversion of a large number of block lower-diagonal matrices. While such inversions are easily done in serial, a high level of sophistication is needed for implementations on massively parallel platforms. Rectangular-mesh methods are well-established and generally quite efficient but unstructured-mesh methods remain a research topic. Nonetheless, considerable progress has been made in unstructured-mesh methods over the last several years. In general, the efficiency of transport solution algorithms are quite sensitive to communication latencies and bandwidth, but there are other significant considerations as well. Some new parallel algorithms have recently been defined that may be significantly better than existing methods for time-dependent problems, but will be significantly less effective for steady-state problems in some circumstances. Transport methods would benefit from a machine architecture with low latencies, high bandwidth, and on the order of one thousand very fast, large-memory processors, as opposed to an architecture that consists of a very large number of slower processors with less memory. In addition, a lightweight operating system is highly desirable.
Biomimetic design processes in architecture: morphogenetic and evolutionary computational design.
Menges, Achim
2012-03-01
Design computation has profound impact on architectural design methods. This paper explains how computational design enables the development of biomimetic design processes specific to architecture, and how they need to be significantly different from established biomimetic processes in engineering disciplines. The paper first explains the fundamental difference between computer-aided and computational design in architecture, as the understanding of this distinction is of critical importance for the research presented. Thereafter, the conceptual relation and possible transfer of principles from natural morphogenesis to design computation are introduced and the related developments of generative, feature-based, constraint-based, process-based and feedback-based computational design methods are presented. This morphogenetic design research is then related to exploratory evolutionary computation, followed by the presentation of two case studies focusing on the exemplary development of spatial envelope morphologies and urban block morphologies.
Advanced algorithm for orbit computation
NASA Technical Reports Server (NTRS)
Szenbehely, V.
1983-01-01
Computational and analytical techniques which simplify the solution of complex problems in orbit mechanics, Astrodynamics and Celestial Mechanics were developed. The major tool of the simplification is the substitution of transformations in place of numerical or analytical integrations. In this way the rather complicated equations of orbit mechanics might sometimes be reduced to linear equations representing harmonic oscillators with constant coefficients.
Resource efficient hardware architecture for fast computation of running max/min filters.
Torres-Huitzil, Cesar
2013-01-01
Running max/min filters on rectangular kernels are widely used in many digital signal and image processing applications. Filtering with a k × k kernel requires of k(2) - 1 comparisons per sample for a direct implementation; thus, performance scales expensively with the kernel size k. Faster computations can be achieved by kernel decomposition and using constant time one-dimensional algorithms on custom hardware. This paper presents a hardware architecture for real-time computation of running max/min filters based on the van Herk/Gil-Werman (HGW) algorithm. The proposed architecture design uses less computation and memory resources than previously reported architectures when targeted to Field Programmable Gate Array (FPGA) devices. Implementation results show that the architecture is able to compute max/min filters, on 1024 × 1024 images with up to 255 × 255 kernels, in around 8.4 milliseconds, 120 frames per second, at a clock frequency of 250 MHz. The implementation is highly scalable for the kernel size with good performance/area tradeoff suitable for embedded applications. The applicability of the architecture is shown for local adaptive image thresholding. PMID:24288456
Fast diffraction computation algorithms based on FFT
NASA Astrophysics Data System (ADS)
Logofatu, Petre Catalin; Nascov, Victor; Apostol, Dan
2010-11-01
The discovery of the Fast Fourier transform (FFT) algorithm by Cooley and Tukey meant for diffraction computation what the invention of computers meant for computation in general. The computation time reduction is more significant for large input data, but generally FFT reduces the computation time with several orders of magnitude. This was the beginning of an entire revolution in optical signal processing and resulted in an abundance of fast algorithms for diffraction computation in a variety of situations. The property that allowed the creation of these fast algorithms is that, as it turns out, most diffraction formulae contain at their core one or more Fourier transforms which may be rapidly calculated using the FFT. The key in discovering a new fast algorithm is to reformulate the diffraction formulae so that to identify and isolate the Fourier transforms it contains. In this way, the fast scaled transformation, the fast Fresnel transformation and the fast Rayleigh-Sommerfeld transform were designed. Remarkable improvements were the generalization of the DFT to scaled DFT which allowed freedom to choose the dimensions of the output window for the Fraunhofer-Fourier and Fresnel diffraction, the mathematical concept of linearized convolution which thwarts the circular character of the discrete Fourier transform and allows the use of the FFT, and last but not least the linearized discrete scaled convolution, a new concept of which we claim priority.
Parallel algorithms for computing linked list prefix
Han, Y. )
1989-06-01
Given a linked list chi/sub 1/, chi/sub 2/, ....chi/sub n/ with chi/sub i/ following chi/sub i-1/ in the list and an associative operation O, the linked list prefix problem is to compute all prefixes O/sup j//sub i=1/chi/sub 1/, j=1,2,...,n. In this paper the authors study the linked list prefix problem on parallel computation models. A deterministic algorithm for computing a linked list prefix on a completely connected parallel computation model is obtained by applying vector balancing techniques. The time complexity of the algorithm is O(n/rho + rho log rho), where n is the number of elements in the linked list and rho is the number of processors used. Therefore their algorithm is optimal when n {ge}rho/sup 2/logrho. A PRAM linked list prefix algorithm is also presented. This PRAM algorithm has time complexity O(n/rho + log rho) with small multiplicative constant. It is optimal when n {ge}rho log rho.
Practical Algorithm For Computing The 2-D Arithmetic Fourier Transform
NASA Astrophysics Data System (ADS)
Reed, Irving S.; Choi, Y. Y.; Yu, Xiaoli
1989-05-01
Recently, Tufts and Sadasiv [10] exposed a method for computing the coefficients of a Fourier series of a periodic function using the Mobius inversion of series. They called this method of analysis the Arithmetic Fourier Transform(AFT). The advantage of the AFT over the FN 1' is that this method of Fourier analysis needs only addition operations except for multiplications by scale factors at one stage of the computation. The disadvantage of the AFT as they expressed it originally is that it could be used effectively only to compute finite Fourier coefficients of a real even function. To remedy this the AFT developed in [10] is extended in [11] to compute the Fourier coefficients of both the even and odd components of a periodic function. In this paper, the improved AFT [11] is extended to a two-dimensional(2-D) Arithmetic Fourier Transform for calculating the Fourier Transform of two-dimensional discrete signals. This new algorithm is based on both the number-theoretic method of Mobius inversion of double series and the complex conjugate property of Fourier coefficients. The advantage of this algorithm over the conventional 2-D FFT is that the corner-turning problem needed in a conventional 2-D Discrete Fourier Transform(DFT) can be avoided. Therefore, this new 2-D algorithm is readily suitable for VLSI implementation as a parallel architecture. Comparing the operations of 2-D AFT of a MxM 2-D data array with the conventional 2-D FFT, the number of multiplications is significantly reduced from (2log2M)M2 to (9/4)M2. Hence, this new algorithm is faster than the FFT algorithm. Finally, two simulation results of this new 2-D AFT algorithm for 2-D artificial and real images are given in this paper.
A new hardware-efficient algorithm and reconfigurable architecture for image contrast enhancement.
Huang, Shih-Chia; Chen, Wen-Chieh
2014-10-01
Contrast enhancement is crucial when generating high quality images for image processing applications, such as digital image or video photography, liquid crystal display processing, and medical image analysis. In order to achieve real-time performance for high-definition video applications, it is necessary to design efficient contrast enhancement hardware architecture to meet the needs of real-time processing. In this paper, we propose a novel hardware-oriented contrast enhancement algorithm which can be implemented effectively for hardware design. In order to be considered for hardware implementation, approximation techniques are proposed to reduce these complex computations during performance of the contrast enhancement algorithm. The proposed hardware-oriented contrast enhancement algorithm achieves good image quality by measuring the results of qualitative and quantitative analyzes. To decrease hardware cost and improve hardware utilization for real-time performance, a reduction in circuit area is proposed through use of parameter-controlled reconfigurable architecture. The experiment results show that the proposed hardware-oriented contrast enhancement algorithm can provide an average frame rate of 48.23 frames/s at high definition resolution 1920 × 1080.
A computational architecture for social agents
Bond, A.H.
1996-12-31
This article describes a new class of information-processing models for social agents. They axe derived from primate brain architecture, the processing in brain regions, the interactions among brain regions, and the social behavior of primates. In another paper, we have reviewed the neuroanatomical connections and functional involvements of cortical regions. We reviewed the evidence for a hierarchical architecture in the primate brain. By examining neuroanatomical evidence for connections among neural areas, we were able to establish anatomical regions and connections. We then examined evidence for specific functional involvements of the different neural axeas and found some support for hierarchical functioning, not only for the perception hierarchies but also for the planning and action hierarchy in the frontal lobes.
Griffiths, Thomas L; Lieder, Falk; Goodman, Noah D
2015-04-01
Marr's levels of analysis-computational, algorithmic, and implementation-have served cognitive science well over the last 30 years. But the recent increase in the popularity of the computational level raises a new challenge: How do we begin to relate models at different levels of analysis? We propose that it is possible to define levels of analysis that lie between the computational and the algorithmic, providing a way to build a bridge between computational- and algorithmic-level models. The key idea is to push the notion of rationality, often used in defining computational-level models, deeper toward the algorithmic level. We offer a simple recipe for reverse-engineering the mind's cognitive strategies by deriving optimal algorithms for a series of increasingly more realistic abstract computational architectures, which we call "resource-rational analysis."
A Simple Physical Optics Algorithm Perfect for Parallel Computing
NASA Technical Reports Server (NTRS)
Imbriale, W. A.; Cwik, T.
1993-01-01
One of the simplest reflector antenna computer programs is based upon a discrete approximation of the radiation integral. This calculation replaces the actual reflector surface with a triangular facet representation so that the reflector resembles a geodesic dome. The Physical Optics (PO) current is assumed to be constant in magnitude and phase over each facet so the radiation integral is reduced to a simple summation. This program has proven to be surprisingly robust and useful for the analysis of arbitrary reflectors, particularly when the near-field is desired and surface derivatives are not known. Because of its simplicity, the algorithm has proven to be extremely easy to adapt to the parallel computing architecture of a modest number of large-grain computing elements such as are used in the Intel iPSC and Touchstone Delta parallel machines.
The Contribution of Visualization to Learning Computer Architecture
ERIC Educational Resources Information Center
Yehezkel, Cecile; Ben-Ari, Mordechai; Dreyfus, Tommy
2007-01-01
This paper describes a visualization environment and associated learning activities designed to improve learning of computer architecture. The environment, EasyCPU, displays a model of the components of a computer and the dynamic processes involved in program execution. We present the results of a research program that analysed the contribution of…
Computer Architecture. (Latest Citations from the Aerospace Database)
NASA Technical Reports Server (NTRS)
1996-01-01
The bibliography contains citations concerning research and development in the field of computer architecture. Design of computer systems, microcomputer components, and digital networks are among the topics discussed. Multimicroprocessor system performance, software development, and aerospace avionics applications are also included. (Contains 50-250 citations and includes a subject term index and title list.)
Implementation of the FDK algorithm for cone-beam CT on the cell broadband engine architecture
NASA Astrophysics Data System (ADS)
Scherl, Holger; Koerner, Mario; Hofmann, Hannes; Eckert, Wieland; Kowarschik, Markus; Hornegger, Joachim
2007-03-01
In most of today's commercially available cone-beam CT scanners, the well known FDK method is used for solving the 3D reconstruction task. The computational complexity of this algorithm prohibits its use for many medical applications without hardware acceleration. The brand-new Cell Broadband Engine Architecture (CBEA) with its high level of parallelism is a cost-efficient processor for performing the FDK reconstruction according to the medical requirements. The programming scheme, however, is quite different to any standard personal computer hardware. In this paper, we present an innovative implementation of the most time-consuming parts of the FDK algorithm: filtering and back-projection. We also explain the required transformations to parallelize the algorithm for the CBEA. Our software framework allows to compute the filtering and back-projection in parallel, making it possible to do an on-the-fly-reconstruction. The achieved results demonstrate that a complete FDK reconstruction is computed with the CBEA in less than seven seconds for a standard clinical scenario. Given the fact that scan times are usually much higher, we conclude that reconstruction is finished right after the end of data acquisition. This enables us to present the reconstructed volume to the physician in real-time, immediately after the last projection image has been acquired by the scanning device.
Problems Related to Parallelization of CFD Algorithms on GPU, Multi-GPU and Hybrid Architectures
NASA Astrophysics Data System (ADS)
Biazewicz, Marek; Kurowski, Krzysztof; Ludwiczak, Bogdan; Napieraia, Krystyna
2010-09-01
Computational Fluid Dynamics (CFD) is one of the branches of fluid mechanics, which uses numerical methods and algorithms to solve and analyze fluid flows. CFD is used in various domains, such as oil and gas reservoir uncertainty analysis, aerodynamic body shapes optimization (e.g. planes, cars, ships, sport helmets, skis), natural phenomena analysis, numerical simulation for weather forecasting or realistic visualizations. CFD problem is very complex and needs a lot of computational power to obtain the results in a reasonable time. We have implemented a parallel application for two-dimensional CFD simulation with a free surface approximation (MAC method) using new hardware architectures, in particular multi-GPU and hybrid computing environments. For this purpose we decided to use NVIDIA graphic cards with CUDA environment due to its simplicity of programming and good computations performance. We used finite difference discretization of Navier-Stokes equations, where fluid is propagated over an Eulerian Grid. In this model, the behavior of the fluid inside the cell depends only on the properties of local, surrounding cells, therefore it is well suited for the GPU-based architecture. In this paper we demonstrate how to use efficiently the computing power of GPUs for CFD. Additionally, we present some best practices to help users analyze and improve the performance of CFD applications executed on GPU. Finally, we discuss various challenges around the multi-GPU implementation on the example of matrix multiplication.
Fault tolerant hypercube computer system architecture
NASA Technical Reports Server (NTRS)
Madan, Herb S. (Inventor); Chow, Edward (Inventor)
1989-01-01
A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node
Architectures of Kepler Planet Systems with Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Morehead, Robert C.; Ford, Eric B.
2015-12-01
The distribution of period normalized transit duration ratios among Kepler’s multiple transiting planet systems constrains the distributions of mutual orbital inclinations and orbital eccentricities. However, degeneracies in these parameters tied to the underlying number of planets in these systems complicate their interpretation. To untangle the true architecture of planet systems, the mutual inclination, eccentricity, and underlying planet number distributions must be considered simultaneously. The complexities of target selection, transit probability, detection biases, vetting, and follow-up observations make it impractical to write an explicit likelihood function. Approximate Bayesian computation (ABC) offers an intriguing path forward. In its simplest form, ABC generates a sample of trial population parameters from a prior distribution to produce synthetic datasets via a physically-motivated forward model. Samples are then accepted or rejected based on how close they come to reproducing the actual observed dataset to some tolerance. The accepted samples form a robust and useful approximation of the true posterior distribution of the underlying population parameters. We build on the considerable progress from the field of statistics to develop sequential algorithms for performing ABC in an efficient and flexible manner. We demonstrate the utility of ABC in exoplanet populations and present new constraints on the distributions of mutual orbital inclinations, eccentricities, and the relative number of short-period planets per star. We conclude with a discussion of the implications for other planet occurrence rate calculations, such as eta-Earth.
Architecture independent environment for developing engineering software on MIMD computers
NASA Technical Reports Server (NTRS)
Valimohamed, Karim A.; Lopez, L. A.
1990-01-01
Engineers are constantly faced with solving problems of increasing complexity and detail. Multiple Instruction stream Multiple Data stream (MIMD) computers have been developed to overcome the performance limitations of serial computers. The hardware architectures of MIMD computers vary considerably and are much more sophisticated than serial computers. Developing large scale software for a variety of MIMD computers is difficult and expensive. There is a need to provide tools that facilitate programming these machines. First, the issues that must be considered to develop those tools are examined. The two main areas of concern were architecture independence and data management. Architecture independent software facilitates software portability and improves the longevity and utility of the software product. It provides some form of insurance for the investment of time and effort that goes into developing the software. The management of data is a crucial aspect of solving large engineering problems. It must be considered in light of the new hardware organizations that are available. Second, the functional design and implementation of a software environment that facilitates developing architecture independent software for large engineering applications are described. The topics of discussion include: a description of the model that supports the development of architecture independent software; identifying and exploiting concurrency within the application program; data coherence; engineering data base and memory management.
Computational algorithms for simulations in atmospheric optics.
Konyaev, P A; Lukin, V P
2016-04-20
A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors.
Computational algorithms for simulations in atmospheric optics.
Konyaev, P A; Lukin, V P
2016-04-20
A computer simulation technique for atmospheric and adaptive optics based on parallel programing is discussed. A parallel propagation algorithm is designed and a modified spectral-phase method for computer generation of 2D time-variant random fields is developed. Temporal power spectra of Laguerre-Gaussian beam fluctuations are considered as an example to illustrate the applications discussed. Implementation of the proposed algorithms using Intel MKL and IPP libraries and NVIDIA CUDA technology is shown to be very fast and accurate. The hardware system for the computer simulation is an off-the-shelf desktop with an Intel Core i7-4790K CPU operating at a turbo-speed frequency up to 5 GHz and an NVIDIA GeForce GTX-960 graphics accelerator with 1024 1.5 GHz processors. PMID:27140113
Heavy Lift Vehicle (HLV) Avionics Flight Computing Architecture Study
NASA Technical Reports Server (NTRS)
Hodson, Robert F.; Chen, Yuan; Morgan, Dwayne R.; Butler, A. Marc; Sdhuh, Joseph M.; Petelle, Jennifer K.; Gwaltney, David A.; Coe, Lisa D.; Koelbl, Terry G.; Nguyen, Hai D.
2011-01-01
A NASA multi-Center study team was assembled from LaRC, MSFC, KSC, JSC and WFF to examine potential flight computing architectures for a Heavy Lift Vehicle (HLV) to better understand avionics drivers. The study examined Design Reference Missions (DRMs) and vehicle requirements that could impact the vehicles avionics. The study considered multiple self-checking and voting architectural variants and examined reliability, fault-tolerance, mass, power, and redundancy management impacts. Furthermore, a goal of the study was to develop the skills and tools needed to rapidly assess additional architectures should requirements or assumptions change.
Algorithm-dependent fault tolerance for distributed computing
P. D. Hough; M. e. Goldsby; E. J. Walsh
2000-02-01
Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.
Aerodynamic optimization studies on advanced architecture computers
NASA Technical Reports Server (NTRS)
Chawla, Kalpana
1995-01-01
The approach to carrying out multi-discipline aerospace design studies in the future, especially in massively parallel computing environments, comprises of choosing (1) suitable solvers to compute solutions to equations characterizing a discipline, and (2) efficient optimization methods. In addition, for aerodynamic optimization problems, (3) smart methodologies must be selected to modify the surface shape. In this research effort, a 'direct' optimization method is implemented on the Cray C-90 to improve aerodynamic design. It is coupled with an existing implicit Navier-Stokes solver, OVERFLOW, to compute flow solutions. The optimization method is chosen such that it can accomodate multi-discipline optimization in future computations. In the work , however, only single discipline aerodynamic optimization will be included.
Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures
Datta, Kaushik; Murphy, Mark; Volkov, Vasily; Williams, Samuel; Carter, Jonathan; Oliker, Leonid; Patterson, David; Shalf, John; Yelick, Katherine
2008-08-22
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations -- a class of algorithms at the heart of many structured grid codes, including PDE solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural trade-offs of emerging multicore designs and their implications on scientific algorithm development.
Gálvez, Sergio; Ferusic, Adis; Esteban, Francisco J; Hernández, Pilar; Caballero, Juan A; Dorado, Gabriel
2016-10-01
The Smith-Waterman algorithm has a great sensitivity when used for biological sequence-database searches, but at the expense of high computing-power requirements. To overcome this problem, there are implementations in literature that exploit the different hardware-architectures available in a standard PC, such as GPU, CPU, and coprocessors. We introduce an application that splits the original database-search problem into smaller parts, resolves each of them by executing the most efficient implementations of the Smith-Waterman algorithms in different hardware architectures, and finally unifies the generated results. Using non-overlapping hardware allows simultaneous execution, and up to 2.58-fold performance gain, when compared with any other algorithm to search sequence databases. Even the performance of the popular BLAST heuristic is exceeded in 78% of the tests. The application has been tested with standard hardware: Intel i7-4820K CPU, Intel Xeon Phi 31S1P coprocessors, and nVidia GeForce GTX 960 graphics cards. An important increase in performance has been obtained in a wide range of situations, effectively exploiting the available hardware.
LINCS: Livermore's network architecture. [Octopus computing network
Fletcher, J.G.
1982-01-01
Octopus, a local computing network that has been evolving at the Lawrence Livermore National Laboratory for over fifteen years, is currently undergoing a major revision. The primary purpose of the revision is to consolidate and redefine the variety of conventions and formats, which have grown up over the years, into a single standard family of protocols, the Livermore Interactive Network Communication Standard (LINCS). This standard treats the entire network as a single distributed operating system such that access to a computing resource is obtained in a single way, whether that resource is local (on the same computer as the accessing process) or remote (on another computer). LINCS encompasses not only communication but also such issues as the relationship of customer to server processes and the structure, naming, and protection of resources. The discussion includes: an overview of the Livermore user community and computing hardware, the functions and structure of each of the seven layers of LINCS protocol, the reasons why we have designed our own protocols and why we are dissatisfied by the directions that current protocol standards are taking.
Layered Architectures for Quantum Computers and Quantum Repeaters
NASA Astrophysics Data System (ADS)
Jones, Nathan C.
This chapter examines how to organize quantum computers and repeaters using a systematic framework known as layered architecture, where machine control is organized in layers associated with specialized tasks. The framework is flexible and could be used for analysis and comparison of quantum information systems. To demonstrate the design principles in practice, we develop architectures for quantum computers and quantum repeaters based on optically controlled quantum dots, showing how a myriad of technologies must operate synchronously to achieve fault-tolerance. Optical control makes information processing in this system very fast, scalable to large problem sizes, and extendable to quantum communication.
Architecture and grid application of cluster computing system
NASA Astrophysics Data System (ADS)
Lv, Yi; Yu, Shuiqin; Mao, Youju
2004-11-01
Recently, people pay more attention to the grid technology. It can not only connect all kinds of resources in the network, but also put them into a super transparent computing environment for customers to realize mete-computing which can share computing resources. Traditional parallel computing system, such as SMP(Symmetrical multiprocessor) and MPP(massively parallel processor), use multi-processors to raise computing speed in a close coupling way, so the flexible and scalable performance of the system are limited, as a result of it, the system can't meet the requirement of the grid technology. In this paper, the architecture of cluster computing system applied in grid nodes is introduced. It mainly includes the following aspects. First, the network architecture of cluster computing system in grid nodes is analyzed and designed. Second, how to realize distributing computing (including coordinating computing and sharing computing) of cluster computing system in grid nodes to construct virtual node computers is discussed. Last, communication among grid nodes is analyzed. In other words, it discusses how to realize single reflection to let all the service requirements from customers be met through sending to the grid nodes.
Large computer systems and new architectures addendum
NASA Astrophysics Data System (ADS)
Bloch, T.
1982-05-01
This addendum brings the paper referenced up-to-date as of January 1981. The new market situation in the area of super computers is reviewed and the only really new machine, the CDC CYBER 205, is described. The CRAY-1S series is briefly described and the FPS-164 is mentioned. The fact that Burroughs discontinued the BSP project is mentioned.
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable
Schuller, Ivan K.; Stevens, Rick; Pino, Robinson; Pechan, Michael
2015-10-29
Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS based technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.
Compression Techniques for Improved Algorithm Computational Performance
NASA Technical Reports Server (NTRS)
Zalameda, Joseph N.; Howell, Patricia A.; Winfree, William P.
2005-01-01
Analysis of thermal data requires the processing of large amounts of temporal image data. The processing of the data for quantitative information can be time intensive especially out in the field where large areas are inspected resulting in numerous data sets. By applying a temporal compression technique, improved algorithm performance can be obtained. In this study, analysis techniques are applied to compressed and non-compressed thermal data. A comparison is made based on computational speed and defect signal to noise.
A computationally efficient particle-simulation method suited to vector-computer architectures
McDonald, J.D.
1990-01-01
Recent interest in a National Aero-Space Plane (NASP) and various Aero-assisted Space Transfer Vehicles (ASTVs) presents the need for a greater understanding of high-speed rarefied flight conditions. Particle simulation techniques such as the Direct Simulation Monte Carlo (DSMC) method are well suited to such problems, but the high cost of computation limits the application of the methods to two-dimensional or very simple three-dimensional problems. This research re-examines the algorithmic structure of existing particle simulation methods and re-structures them to allow efficient implementation on vector-oriented supercomputers. A brief overview of the DSMC method and the Cray-2 vector computer architecture are provided, and the elements of the DSMC method that inhibit substantial vectorization are identified. One such element is the collision selection algorithm. A complete reformulation of underlying kinetic theory shows that this may be efficiently vectorized for general gas mixtures. The mechanics of collisions are vectorizable in the DSMC method, but several optimizations are suggested that greatly enhance performance. Also this thesis proposes a new mechanism for the exchange of energy between vibration and other energy modes. The developed scheme makes use of quantized vibrational states and is used in place of the Borgnakke-Larsen model. Finally, a simplified representation of physical space and boundary conditions is utilized to further reduce the computational cost of the developed method. Comparison to solutions obtained from the DSMC method for the relaxation of internal energy modes in a homogeneous gas, as well as single and multiple specie shock wave profiles, are presented. Additionally, a large scale simulation of the flow about the proposed Aeroassisted Flight Experiment (AFE) vehicle is included as an example of the new computational capability of the developed particle simulation method.
Algorithms for parallel flow solvers on message passing architectures
NASA Astrophysics Data System (ADS)
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those
Pipelined CPU Design with FPGA in Teaching Computer Architecture
ERIC Educational Resources Information Center
Lee, Jong Hyuk; Lee, Seung Eun; Yu, Heon Chang; Suh, Taeweon
2012-01-01
This paper presents a pipelined CPU design project with a field programmable gate array (FPGA) system in a computer architecture course. The class project is a five-stage pipelined 32-bit MIPS design with experiments on the Altera DE2 board. For proper scheduling, milestones were set every one or two weeks to help students complete the project on…
Parallel algorithms for optical digital computers
Huang, A.
1983-01-01
Conventional computers suffer from several communication bottlenecks which fundamentally limit their performance. These bottlenecks are characterised by an address-dependent sequential transfer of information which arises from the need to time-multiplex information over a limited number of interconnections. An optical digital computer based on a classical finite state machine can be shown to be free of these bottlenecks. Such a processor would be unique since it would be capable of modifying its entire state space each cycle while conventional computers can only alter a few bits. New algorithms are needed to manage and use this capability. A technique based on recognising a particular symbol in parallel and replacing it in parallel with another symbol is suggested. Examples using this parallel symbolic substitution to perform binary addition and binary incrementation are presented. Applications involving Boolean logic, functional programming languages, production rule driven artificial intelligence, and molecular chemistry are also discussed. 12 references.
VLSI architectures for computing multiplications and inverses in GF(2-m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.; Reed, I. S.
1983-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
VLSI architectures for computing multiplications and inverses in GF(2m)
NASA Technical Reports Server (NTRS)
Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.
1985-01-01
Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
NASA Astrophysics Data System (ADS)
Trigueros-Espinosa, Blas; Vélez-Reyes, Miguel; Santiago-Santiago, Nayda G.; Rosario-Torres, Samuel
2011-06-01
Hyperspectral sensors can collect hundreds of images taken at different narrow and contiguously spaced spectral bands. This high-resolution spectral information can be used to identify materials and objects within the field of view of the sensor by their spectral signature, but this process may be computationally intensive due to the large data sizes generated by the hyperspectral sensors, typically hundreds of megabytes. This can be an important limitation for some applications where the detection process must be performed in real time (surveillance, explosive detection, etc.). In this work, we developed a parallel implementation of three state-ofthe- art target detection algorithms (RX algorithm, matched filter and adaptive matched subspace detector) using a graphics processing unit (GPU) based on the NVIDIA® CUDA™ architecture. In addition, a multi-core CPUbased implementation of each algorithm was developed to be used as a baseline for the speedups estimation. We evaluated the performance of the GPU-based implementations using an NVIDIA ® Tesla® C1060 GPU card, and the detection accuracy of the implemented algorithms was evaluated using a set of phantom images simulating traces of different materials on clothing. We achieved a maximum speedup in the GPU implementations of around 20x over a multicore CPU-based implementation, which suggests that applications for real-time detection of targets in HSI can greatly benefit from the performance of GPUs as processing hardware.
Fast search algorithms for computational protein design.
Traoré, Seydou; Roberts, Kyle E; Allouche, David; Donald, Bruce R; André, Isabelle; Schiex, Thomas; Barbe, Sophie
2016-05-01
One of the main challenges in computational protein design (CPD) is the huge size of the protein sequence and conformational space that has to be computationally explored. Recently, we showed that state-of-the-art combinatorial optimization technologies based on Cost Function Network (CFN) processing allow speeding up provable rigid backbone protein design methods by several orders of magnitudes. Building up on this, we improved and injected CFN technology into the well-established CPD package Osprey to allow all Osprey CPD algorithms to benefit from associated speedups. Because Osprey fundamentally relies on the ability of A* to produce conformations in increasing order of energy, we defined new A* strategies combining CFN lower bounds, with new side-chain positioning-based branching scheme. Beyond the speedups obtained in the new A*-CFN combination, this novel branching scheme enables a much faster enumeration of suboptimal sequences, far beyond what is reachable without it. Together with the immediate and important speedups provided by CFN technology, these developments directly benefit to all the algorithms that previously relied on the DEE/ A* combination inside Osprey* and make it possible to solve larger CPD problems with provable algorithms.
Fast search algorithms for computational protein design.
Traoré, Seydou; Roberts, Kyle E; Allouche, David; Donald, Bruce R; André, Isabelle; Schiex, Thomas; Barbe, Sophie
2016-05-01
One of the main challenges in computational protein design (CPD) is the huge size of the protein sequence and conformational space that has to be computationally explored. Recently, we showed that state-of-the-art combinatorial optimization technologies based on Cost Function Network (CFN) processing allow speeding up provable rigid backbone protein design methods by several orders of magnitudes. Building up on this, we improved and injected CFN technology into the well-established CPD package Osprey to allow all Osprey CPD algorithms to benefit from associated speedups. Because Osprey fundamentally relies on the ability of A* to produce conformations in increasing order of energy, we defined new A* strategies combining CFN lower bounds, with new side-chain positioning-based branching scheme. Beyond the speedups obtained in the new A*-CFN combination, this novel branching scheme enables a much faster enumeration of suboptimal sequences, far beyond what is reachable without it. Together with the immediate and important speedups provided by CFN technology, these developments directly benefit to all the algorithms that previously relied on the DEE/ A* combination inside Osprey* and make it possible to solve larger CPD problems with provable algorithms. PMID:26833706
Computational algorithms to predict Gene Ontology annotations
2015-01-01
Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper
Innovative architectures for dense multi-microprocessor computers
NASA Technical Reports Server (NTRS)
Larson, Robert E.
1989-01-01
The purpose is to summarize a Phase 1 SBIR project performed for the NASA/Langley Computational Structural Mechanics Group. The project was performed from February to August 1987. The main objectives of the project were to: (1) expand upon previous research into the application of chordal ring architectures to the general problem of designing multi-microcomputer architectures, (2) attempt to identify a family of chordal rings such that each chordal ring can be simply expanded to produce the next member of the family, (3) perform a preliminary, high-level design of an expandable multi-microprocessor computer based upon chordal rings, (4) analyze the potential use of chordal ring based multi-microprocessors for sparse matrix problems and other applications arising in computational structural mechanics.
Domain decomposition algorithms and computational fluid dynamics
NASA Technical Reports Server (NTRS)
Chan, Tony F.
1988-01-01
Some of the new domain decomposition algorithms are applied to two model problems in computational fluid dynamics: the two-dimensional convection-diffusion problem and the incompressible driven cavity flow problem. First, a brief introduction to the various approaches of domain decomposition is given, and a survey of domain decomposition preconditioners for the operator on the interface separating the subdomains is then presented. For the convection-diffusion problem, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is examined.
Portable computer system architecture for the Space Station Freedom program
NASA Technical Reports Server (NTRS)
Alena, Richard; Liu, Yuan-Kwei; Fernquist, Alan R.
1993-01-01
This paper outlines various mission requirements and technical approaches that support the potential use of portable computers in several defined activities within the Space Station Freedom (SSF) program. Specifically, the use of portable computers as consoles for both spacecraft control and payload applications is presented. Various issues and proposed solutions regarding the incorporation of portable computers within the program are presented. The primary issues presented regard architecture (standard interface for expansion, advanced processors and displays), integration (methods of high-speed data communication, peripheral interfaces, and interconnectivity within various support networks), and evolution (wireless communications and multimedia data interface methods).
NASA Astrophysics Data System (ADS)
Shi, X.
2015-12-01
As NSF indicated - "Theory and experimentation have for centuries been regarded as two fundamental pillars of science. It is now widely recognized that computational and data-enabled science forms a critical third pillar." Geocomputation is the third pillar of GIScience and geosciences. With the exponential growth of geodata, the challenge of scalable and high performance computing for big data analytics become urgent because many research activities are constrained by the inability of software or tool that even could not complete the computation process. Heterogeneous geodata integration and analytics obviously magnify the complexity and operational time frame. Many large-scale geospatial problems may be not processable at all if the computer system does not have sufficient memory or computational power. Emerging computer architectures, such as Intel's Many Integrated Core (MIC) Architecture and Graphics Processing Unit (GPU), and advanced computing technologies provide promising solutions to employ massive parallelism and hardware resources to achieve scalability and high performance for data intensive computing over large spatiotemporal and social media data. Exploring novel algorithms and deploying the solutions in massively parallel computing environment to achieve the capability for scalable data processing and analytics over large-scale, complex, and heterogeneous geodata with consistent quality and high-performance has been the central theme of our research team in the Department of Geosciences at the University of Arkansas (UARK). New multi-core architectures combined with application accelerators hold the promise to achieve scalability and high performance by exploiting task and data levels of parallelism that are not supported by the conventional computing systems. Such a parallel or distributed computing environment is particularly suitable for large-scale geocomputation over big data as proved by our prior works, while the potential of such advanced
Architecture Framework for Trapped-Ion Quantum Computer based on Performance Simulation Tool
NASA Astrophysics Data System (ADS)
Ahsan, Muhammad
The challenge of building scalable quantum computer lies in striking appropriate balance between designing a reliable system architecture from large number of faulty computational resources and improving the physical quality of system components. The detailed investigation of performance variation with physics of the components and the system architecture requires adequate performance simulation tool. In this thesis we demonstrate a software tool capable of (1) mapping and scheduling the quantum circuit on a realistic quantum hardware architecture with physical resource constraints, (2) evaluating the performance metrics such as the execution time and the success probability of the algorithm execution, and (3) analyzing the constituents of these metrics and visualizing resource utilization to identify system components which crucially define the overall performance. Using this versatile tool, we explore vast design space for modular quantum computer architecture based on trapped ions. We find that while success probability is uniformly determined by the fidelity of physical quantum operation, the execution time is a function of system resources invested at various layers of design hierarchy. At physical level, the number of lasers performing quantum gates, impact the latency of the fault-tolerant circuit blocks execution. When these blocks are used to construct meaningful arithmetic circuit such as quantum adders, the number of ancilla qubits for complicated non-clifford gates and entanglement resources to establish long-distance communication channels, become major performance limiting factors. Next, in order to factorize large integers, these adders are assembled into modular exponentiation circuit comprising bulk of Shor's algorithm. At this stage, the overall scaling of resource-constraint performance with the size of problem, describes the effectiveness of chosen design. By matching the resource investment with the pace of advancement in hardware technology
Fault-tolerant computer architecture based on INMOS transputer processor
NASA Technical Reports Server (NTRS)
Ortiz, Jorge L.
1987-01-01
Redundant processing was used for several years in mission flight systems. In these systems, more than one processor performs the same task at the same time but only one processor is actually in real use. A fault-tolerance computer architecture based on the features provided by INMOS Transputers is presented. The Transputer architecture provides several communication links that allow data and command communication with other Transputers without the use of a bus. Additionally the Transputer allows the use of parallel processing to increase the system speed considerably. The processor architecture consists of three processors working in parallel keeping all the processors at the same operational level but only one processor is in real control of the process. The design allows each Transputer to perform a test to the other two Transputers and report the operating condition of the neighboring processors. A graphic display was developed to facilitate the identification of any problem by the user.
Hybrid parallel computing architecture for multiview phase shifting
NASA Astrophysics Data System (ADS)
Zhong, Kai; Li, Zhongwei; Zhou, Xiaohui; Shi, Yusheng; Wang, Congjun
2014-11-01
The multiview phase-shifting method shows its powerful capability in achieving high resolution three-dimensional (3-D) shape measurement. Unfortunately, this ability results in very high computation costs and 3-D computations have to be processed offline. To realize real-time 3-D shape measurement, a hybrid parallel computing architecture is proposed for multiview phase shifting. In this architecture, the central processing unit can co-operate with the graphic processing unit (GPU) to achieve hybrid parallel computing. The high computation cost procedures, including lens distortion rectification, phase computation, correspondence, and 3-D reconstruction, are implemented in GPU, and a three-layer kernel function model is designed to simultaneously realize coarse-grained and fine-grained paralleling computing. Experimental results verify that the developed system can perform 50 fps (frame per second) real-time 3-D measurement with 260 K 3-D points per frame. A speedup of up to 180 times is obtained for the performance of the proposed technique using a NVIDIA GT560Ti graphics card rather than a sequential C in a 3.4 GHZ Inter Core i7 3770.
OS friendly microprocessor architecture: Hardware level computer security
NASA Astrophysics Data System (ADS)
Jungwirth, Patrick; La Fratta, Patrick
2016-05-01
We present an introduction to the patented OS Friendly Microprocessor Architecture (OSFA) and hardware level computer security. Conventional microprocessors have not tried to balance hardware performance and OS performance at the same time. Conventional microprocessors have depended on the Operating System for computer security and information assurance. The goal of the OS Friendly Architecture is to provide a high performance and secure microprocessor and OS system. We are interested in cyber security, information technology (IT), and SCADA control professionals reviewing the hardware level security features. The OS Friendly Architecture is a switched set of cache memory banks in a pipeline configuration. For light-weight threads, the memory pipeline configuration provides near instantaneous context switching times. The pipelining and parallelism provided by the cache memory pipeline provides for background cache read and write operations while the microprocessor's execution pipeline is running instructions. The cache bank selection controllers provide arbitration to prevent the memory pipeline and microprocessor's execution pipeline from accessing the same cache bank at the same time. This separation allows the cache memory pages to transfer to and from level 1 (L1) caching while the microprocessor pipeline is executing instructions. Computer security operations are implemented in hardware. By extending Unix file permissions bits to each cache memory bank and memory address, the OSFA provides hardware level computer security.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
NASA Astrophysics Data System (ADS)
Bogdanov, P. B.; Gorobets, A. V.; Sukov, S. A.
2013-08-01
The design of efficient algorithms for large-scale gas dynamics computations with hybrid (heterogeneous) computing systems whose high performance relies on massively parallel accelerators is addressed. A high-order accurate finite volume algorithm with polynomial reconstruction on unstructured hybrid meshes is used to compute compressible gas flows in domains of complex geometry. The basic operations of the algorithm are implemented in detail for massively parallel accelerators, including AMD and NVIDIA graphics processing units (GPUs). Major optimization approaches and a computation transfer technique are covered. The underlying programming tool is the Open Computing Language (OpenCL) standard, which performs on accelerators of various architectures, both existing and emerging.
Multi-level Hierarchical Poly Tree computer architectures
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug
1990-01-01
Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
Integration of nanoscale memristor synapses in neuromorphic computing architectures
NASA Astrophysics Data System (ADS)
Indiveri, Giacomo; Linares-Barranco, Bernabé; Legenstein, Robert; Deligeorgis, George; Prodromakis, Themistoklis
2013-09-01
Conventional neuro-computing architectures and artificial neural networks have often been developed with no or loose connections to neuroscience. As a consequence, they have largely ignored key features of biological neural processing systems, such as their extremely low-power consumption features or their ability to carry out robust and efficient computation using massively parallel arrays of limited precision, highly variable, and unreliable components. Recent developments in nano-technologies are making available extremely compact and low power, but also variable and unreliable solid-state devices that can potentially extend the offerings of availing CMOS technologies. In particular, memristors are regarded as a promising solution for modeling key features of biological synapses due to their nanoscale dimensions, their capacity to store multiple bits of information per element and the low energy required to write distinct states. In this paper, we first review the neuro- and neuromorphic computing approaches that can best exploit the properties of memristor and scale devices, and then propose a novel hybrid memristor-CMOS neuromorphic circuit which represents a radical departure from conventional neuro-computing approaches, as it uses memristors to directly emulate the biophysics and temporal dynamics of real synapses. We point out the differences between the use of memristors in conventional neuro-computing architectures and the hybrid memristor-CMOS circuit proposed, and argue how this circuit represents an ideal building block for implementing brain-inspired probabilistic computing paradigms that are robust to variability and fault tolerant by design.
Integration of nanoscale memristor synapses in neuromorphic computing architectures.
Indiveri, Giacomo; Linares-Barranco, Bernabé; Legenstein, Robert; Deligeorgis, George; Prodromakis, Themistoklis
2013-09-27
Conventional neuro-computing architectures and artificial neural networks have often been developed with no or loose connections to neuroscience. As a consequence, they have largely ignored key features of biological neural processing systems, such as their extremely low-power consumption features or their ability to carry out robust and efficient computation using massively parallel arrays of limited precision, highly variable, and unreliable components. Recent developments in nano-technologies are making available extremely compact and low power, but also variable and unreliable solid-state devices that can potentially extend the offerings of availing CMOS technologies. In particular, memristors are regarded as a promising solution for modeling key features of biological synapses due to their nanoscale dimensions, their capacity to store multiple bits of information per element and the low energy required to write distinct states. In this paper, we first review the neuro- and neuromorphic computing approaches that can best exploit the properties of memristor and scale devices, and then propose a novel hybrid memristor-CMOS neuromorphic circuit which represents a radical departure from conventional neuro-computing approaches, as it uses memristors to directly emulate the biophysics and temporal dynamics of real synapses. We point out the differences between the use of memristors in conventional neuro-computing architectures and the hybrid memristor-CMOS circuit proposed, and argue how this circuit represents an ideal building block for implementing brain-inspired probabilistic computing paradigms that are robust to variability and fault tolerant by design.
Hardware architecture for full analytical Fraunhofer computer-generated holograms
NASA Astrophysics Data System (ADS)
Pang, Zhi-Yong; Xu, Zong-Xi; Xiong, Yi; Chen, Biao; Dai, Hui-Min; Jiang, Shao-Ji; Dong, Jian-Wen
2015-09-01
Hardware architecture of parallel computation is proposed for generating Fraunhofer computer-generated holograms (CGHs). A pipeline-based integrated circuit architecture is realized by employing the modified Fraunhofer analytical formulism, which is large scale and enables all components to be concurrently operated. The architecture of the CGH contains five modules to calculate initial parameters of amplitude, amplitude compensation, phases, and phase compensation, respectively. The precalculator of amplitude is fully adopted considering the "reusable design" concept. Each complex operation type (such as square arithmetic) is reused only once by means of a multichannel selector. The implemented hardware calculates an 800×600 pixels hologram in parallel using 39,319 logic elements, 21,074 registers, and 12,651 memory bits in an Altera field-programmable gate array environment with stable operation at 50 MHz. Experimental results demonstrate that the quality of the images reconstructed from the hardware-generated hologram can be comparable to that of a software implementation. Moreover, the calculation speed is approximately 100 times faster than that of a personal computer with an Intel i5-3230M 2.6 GHz CPU for a triangular object.
Thermodynamic cost of computation, algorithmic complexity and the information metric
NASA Technical Reports Server (NTRS)
Zurek, W. H.
1989-01-01
Algorithmic complexity is discussed as a computational counterpart to the second law of thermodynamics. It is shown that algorithmic complexity, which is a measure of randomness, sets limits on the thermodynamic cost of computations and casts a new light on the limitations of Maxwell's demon. Algorithmic complexity can also be used to define distance between binary strings.
Parallel algorithm for computing points on a computation front hyperplane
NASA Astrophysics Data System (ADS)
Krasnov, M. M.
2015-01-01
A parallel algorithm for computing points on a computation front hyperplane is described. This task arises in the computation of a quantity defined on a multidimensional rectangular domain. Three-dimensional domains are usually discussed, but the material is given in the general form when the number of measurements is at least two. When the values of a quantity at different points are internally independent (which is frequently the case), the corresponding computations are independent as well and can be performed in parallel. However, if there are internal dependences (as, for example, in the Gauss-Seidel method for systems of linear equations), then the order of scanning points of the domain is an important issue. A conventional approach in this case is to form a computation front hyperplane (a usual plane in the three-dimensional case and a line in the two-dimensional case) that moves linearly across the domain at a certain angle. At every step in the course of motion of this hyperplane, its intersection points with the domain can be treated independently and, hence, in parallel, but the steps themselves are executed sequentially. At different steps, the intersection of the hyperplane with the entire domain can have a rather complex geometry and the search for all points of the domain lying on the hyperplane at a given step is a nontrivial problem. This problem (i.e., the computation of the coordinates of points lying in the intersection of the domain with the hyperplane at a given step in the course of hyperplane motion) is addressed below. The computations over the points of the hyperplane can be executed in parallel.
Scaling to Nanotechnology Limits with the PIMS Computer Architecture and a new Scaling Rule.
Debenedictis, Erik
2015-02-01
We describe a new approach to computing that moves towards the limits of nanotechnology using a newly formulated sc aling rule. This is in contrast to the current computer industry scali ng away from von Neumann's original computer at the rate of Moore's Law. We extend Moore's Law to 3D, which l eads generally to architectures that integrate logic and memory. To keep pow er dissipation cons tant through a 2D surface of the 3D structure requires using adiabatic principles. We call our newly proposed architecture Processor In Memory and Storage (PIMS). We propose a new computational model that integrates processing and memory into "tiles" that comprise logic, memory/storage, and communications functions. Since the programming model will be relatively stable as a system scales, programs repr esented by tiles could be executed in a PIMS system built with today's technology or could become the "schematic diagram" for implementation in an ultimate 3D nanotechnology of the future. We build a systems software approach that offers advantages over and above the technological and arch itectural advantages. Firs t, the algorithms may be more efficient in the conventional sens e of having fewer steps. Second, the algorithms may run with higher power efficiency per operation by being a better match for the adiabatic scaling ru le. The performance analysis based on demonstrated ideas in physical science suggests 80,000 x improvement in cost per operation for the (arguably) gene ral purpose function of emulating neurons in Deep Learning.
Architectural requirements for the Red Storm computing system.
Camp, William J.; Tomkins, James Lee
2003-10-01
This report is based on the Statement of Work (SOW) describing the various requirements for delivering 3 new supercomputer system to Sandia National Laboratories (Sandia) as part of the Department of Energy's (DOE) Accelerated Strategic Computing Initiative (ASCI) program. This system is named Red Storm and will be a distributed memory, massively parallel processor (MPP) machine built primarily out of commodity parts. The requirements presented here distill extensive architectural and design experience accumulated over a decade and a half of research, development and production operation of similar machines at Sandia. Red Storm will have an unusually high bandwidth, low latency interconnect, specially designed hardware and software reliability features, a light weight kernel compute node operating system and the ability to rapidly switch major sections of the machine between classified and unclassified computing environments. Particular attention has been paid to architectural balance in the design of Red Storm, and it is therefore expected to achieve an atypically high fraction of its peak speed of 41 TeraOPS on real scientific computing applications. In addition, Red Storm is designed to be upgradeable to many times this initial peak capability while still retaining appropriate balance in key design dimensions. Installation of the Red Storm computer system at Sandia's New Mexico site is planned for 2004, and it is expected that the system will be operated for a minimum of five years following installation.
A Component Architecture for High-Performance Scientific Computing
Bernholdt, D E; Allan, B A; Armstrong, R; Bertrand, F; Chiu, K; Dahlgren, T L; Damevski, K; Elwasif, W R; Epperly, T W; Govindaraju, M; Katz, D S; Kohl, J A; Krishnan, M; Kumfert, G; Larson, J W; Lefantzi, S; Lewis, M J; Malony, A D; McInnes, L C; Nieplocha, J; Norris, B; Parker, S G; Ray, J; Shende, S; Windus, T L; Zhou, S
2004-12-14
The Common Component Architecture (CCA) provides a means for software developers to manage the complexity of large-scale scientific simulations and to move toward a plug-and-play environment for high-performance computing. In the scientific computing context, component models also promote collaboration using independently developed software, thereby allowing particular individuals or groups to focus on the aspects of greatest interest to them. The CCA supports parallel and distributed computing as well as local high-performance connections between components in a language-independent manner. The design places minimal requirements on components and thus facilitates the integration of existing code into the CCA environment. The CCA model imposes minimal overhead to minimize the impact on application performance. The focus on high performance distinguishes the CCA from most other component models. The CCA is being applied within an increasing range of disciplines, including combustion research, global climate simulation, and computational chemistry.
Domain decomposition algorithms and computation fluid dynamics
NASA Technical Reports Server (NTRS)
Chan, Tony F.
1988-01-01
In the past several years, domain decomposition was a very popular topic, partly motivated by the potential of parallelization. While a large body of theory and algorithms were developed for model elliptic problems, they are only recently starting to be tested on realistic applications. The application of some of these methods to two model problems in computational fluid dynamics are investigated. Some examples are two dimensional convection-diffusion problems and the incompressible driven cavity flow problem. The construction and analysis of efficient preconditioners for the interface operator to be used in the iterative solution of the interface solution is described. For the convection-diffusion problems, the effect of the convection term and its discretization on the performance of some of the preconditioners is discussed. For the driven cavity problem, the effectiveness of a class of boundary probe preconditioners is discussed.
Algorithmic support for commodity-based parallel computing systems.
Leung, Vitus Joseph; Bender, Michael A.; Bunde, David P.; Phillips, Cynthia Ann
2003-10-01
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
Logeswaran, Rajasvaran; Chen, Li-Choo
2008-12-01
Service architectures are necessary for providing value-added services in telecommunications networks, including those in medical institutions. Separation of service logic and control from the actual call switching is the main idea of these service architectures, examples include Intelligent Network (IN), Telecommunications Information Network Architectures (TINA), and Open Service Access (OSA). In the Distributed Service Architectures (DSA), instances of the same object type can be placed on different physical nodes. Hence, the network performance can be enhanced by introducing load balancing algorithms to efficiently distribute the traffic between object instances, such that the overall throughput and network performance can be optimised. In this paper, we propose a new load balancing algorithm called "Node Status Algorithm" for DSA infrastructure applicable to electronic-based medical institutions. The simulation results illustrate that this proposed algorithm is able to outperform the benchmark load balancing algorithms-Random Algorithm and Shortest Queue Algorithm, especially under medium and heavily loaded network conditions, which are typical of the increasing bandwidth utilization and processing requirements at paperless hospitals and in the telemedicine environment.
Thrifty: An Exascale Architecture for Energy Proportional Computing
Torrellas, Josep
2014-12-23
The objective of this project is to design different aspects of a novel exascale architecture called Thrifty. Our goal is to focus on the challenges of power/energy efficiency, performance, and resiliency in exascale systems. The project includes work on computer architecture (Josep Torrellas from University of Illinois), compilation (Daniel Quinlan from Lawrence Livermore National Laboratory), runtime and applications (Laura Carrington from University of California San Diego), and circuits (Wilfred Pinfold from Intel Corporation). In this report, we focus on the progress at the University of Illinois during the last year of the grant (September 1, 2013 to August 31, 2014). We also point to the progress in the other collaborating institutions when needed.
Integrated command, control, communications and computation system functional architecture
NASA Technical Reports Server (NTRS)
Cooley, C. G.; Gilbert, L. E.
1981-01-01
The functional architecture for an integrated command, control, communications, and computation system applicable to the command and control portion of the NASA End-to-End Data. System is described including the downlink data processing and analysis functions required to support the uplink processes. The functional architecture is composed of four elements: (1) the functional hierarchy which provides the decomposition and allocation of the command and control functions to the system elements; (2) the key system features which summarize the major system capabilities; (3) the operational activity threads which illustrate the interrelationahip between the system elements; and (4) the interfaces which illustrate those elements that originate or generate data and those elements that use the data. The interfaces also provide a description of the data and the data utilization and access techniques.
The computational structural mechanics testbed architecture. Volume 1: The language
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the first set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP, and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 1 presents the basic elements of the CLAMP language and is intended for all users.
The computational structural mechanics testbed architecture. Volume 2: Directives
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the second of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 2 describes the CLIP directives in detail. It is intended for intermediate and advanced users.
The computational structural mechanics testbed architecture. Volume 2: The interface
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the third set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 3 describes the CLIP-Processor interface and related topics. It is intended only for processor developers.
NASA Technical Reports Server (NTRS)
Rutishauser, David K.
2006-01-01
The motivation for this work comes from an observation that amidst the push for Massively Parallel (MP) solutions to high-end computing problems such as numerical physical simulations, large amounts of legacy code exist that are highly optimized for vector supercomputers. Because re-hosting legacy code often requires a complete re-write of the original code, which can be a very long and expensive effort, this work examines the potential to exploit reconfigurable computing machines in place of a vector supercomputer to implement an essentially unmodified legacy source code. Custom and reconfigurable computing resources could be used to emulate an original application's target platform to the extent required to achieve high performance. To arrive at an architecture that delivers the desired performance subject to limited resources involves solving a multi-variable optimization problem with constraints. Prior research in the area of reconfigurable computing has demonstrated that designing an optimum hardware implementation of a given application under hardware resource constraints is an NP-complete problem. The premise of the approach is that the general issue of applying reconfigurable computing resources to the implementation of an application, maximizing the performance of the computation subject to physical resource constraints, can be made a tractable problem by assuming a computational paradigm, such as vector processing. This research contributes a formulation of the problem and a methodology to design a reconfigurable vector processing implementation of a given application that satisfies a performance metric. A generic, parametric, architectural framework for vector processing implemented in reconfigurable logic is developed as a target for a scheduling/mapping algorithm that maps an input computation to a given instance of the architecture. This algorithm is integrated with an optimization framework to arrive at a specification of the architecture parameters
The Snowcloud System: Architecture and Algorithms for Snow Hydrology Studies
NASA Astrophysics Data System (ADS)
Skalka, C.; Brown, I.; Frolik, J.
2013-12-01
Snowcloud is an embedded data collection system for snow hydrology field research campaigns conducted in harsh climates and remote areas. The system combines distributed wireless sensor network technology and computational techniques to provide data at lower cost and higher spatio-temporal resolution than ground-based systems using traditional methods. Snowcloud has seen multiple Winter deployments in settings ranging from high desert to arctic, resulting in over a dozen node-years of practical experience. The Snowcloud system architecture consists of multiple TinyOS mesh-networked sensor stations collecting environmental data above and, in some deployments, below the snowpack. Monitored data modalities include snow depth, ground and air temperature, PAR and leaf-area index (LAI), and soil moisture. To enable power cycling and control of multiple sensors a custom power and sensor conditioning board was developed. The electronics and structural systems for individual stations have been designed and tested (in the lab and in situ) for ease of assembly and robustness to harsh winter conditions. Battery systems and solar chargers enable seasonal operation even under low/no light arctic conditions. Station costs range between 500 and 1000 depending on the instrumentation suite. For remote field locations, a custom designed hand-held device and data retrieval protocol serves as the primary data collection method. We are also developing and testing a Gateway device that will report data in near-real-time (NRT) over a cellular connection. Data is made available to users via web interfaces that also provide basic data analysis and visualization tools. For applications to snow hydrology studies, the better spatiotemporal resolution of snowpack data provided by Snowcloud is beneficial in several aspects. It provides insight into snowpack evolution, and allows us to investigate differences across different spatial and temporal scales in deployment areas. It enables the
The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures
Spafford, Kyle L; Meredith, Jeremy S; Lee, Seyong; Li, Dong; Roth, Philip C; Vetter, Jeffrey S
2012-01-01
With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these accelerators. Traditionally, GPUs have connected to the CPU via the PCIe bus, which has proved to be a significant bottleneck for scalable scientific applications. Now, a trend toward tighter integration between CPU and GPU has removed this bottleneck and unified the memory hierarchy for both CPU and GPU cores. We examine the impact of this trend for high performance scientific computing by investigating AMD's new Fusion Accelerated Processing Unit (APU) as a testbed. In particular, we evaluate the tradeoffs in performance, power consumption, and programmability when comparing this unified memory hierarchy with similar, but discrete GPUs.
Evaluation of leading scalar and vector architectures for scientific computations
Simon, Horst D.; Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Ethier, Stephane; Shalf, John
2004-04-20
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to reduce this gap for many computational science codes and deliver a substantial increase in computing capabilities. This project examines the performance of the cacheless vector Earth Simulator (ES) and compares it to superscalar cache-based IBM Power3 system. Results demonstrate that the ES is significantly faster than the Power3 architecture, highlighting the tremendous potential advantage of the ES for numerical simulation. However, vectorization of a particle-in-cell application (GTC) greatly increased the memory footprint preventing loop-level parallelism and limiting scalability potential.
Biomorphic Multi-Agent Architecture for Persistent Computing
NASA Technical Reports Server (NTRS)
Lodding, Kenneth N.; Brewster, Paul
2009-01-01
A multi-agent software/hardware architecture, inspired by the multicellular nature of living organisms, has been proposed as the basis of design of a robust, reliable, persistent computing system. Just as a multicellular organism can adapt to changing environmental conditions and can survive despite the failure of individual cells, a multi-agent computing system, as envisioned, could adapt to changing hardware, software, and environmental conditions. In particular, the computing system could continue to function (perhaps at a reduced but still reasonable level of performance) if one or more component( s) of the system were to fail. One of the defining characteristics of a multicellular organism is unity of purpose. In biology, the purpose is survival of the organism. The purpose of the proposed multi-agent architecture is to provide a persistent computing environment in harsh conditions in which repair is difficult or impossible. A multi-agent, organism-like computing system would be a single entity built from agents or cells. Each agent or cell would be a discrete hardware processing unit that would include a data processor with local memory, an internal clock, and a suite of communication equipment capable of both local line-of-sight communications and global broadcast communications. Some cells, denoted specialist cells, could contain such additional hardware as sensors and emitters. Each cell would be independent in the sense that there would be no global clock, no global (shared) memory, no pre-assigned cell identifiers, no pre-defined network topology, and no centralized brain or control structure. Like each cell in a living organism, each agent or cell of the computing system would contain a full description of the system encoded as genes, but in this case, the genes would be components of a software genome.
A parallel Jacobson-Oksman optimization algorithm. [parallel processing (computers)
NASA Technical Reports Server (NTRS)
Straeter, T. A.; Markos, A. T.
1975-01-01
A gradient-dependent optimization technique which exploits the vector-streaming or parallel-computing capabilities of some modern computers is presented. The algorithm, derived by assuming that the function to be minimized is homogeneous, is a modification of the Jacobson-Oksman serial minimization method. In addition to describing the algorithm, conditions insuring the convergence of the iterates of the algorithm and the results of numerical experiments on a group of sample test functions are presented. The results of these experiments indicate that this algorithm will solve optimization problems in less computing time than conventional serial methods on machines having vector-streaming or parallel-computing capabilities.
Reveal, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures
NASA Technical Reports Server (NTRS)
Liang, Shoudan; Fuhrman, Stefanie; Somogyi, Roland
1998-01-01
Given the immanent gene expression mapping covering whole genomes during development, health and disease, we seek computational methods to maximize functional inference from such large data sets. Is it possible, in principle, to completely infer a complex regulatory network architecture from input/output patterns of its variables? We investigated this possibility using binary models of genetic networks. Trajectories, or state transition tables of Boolean nets, resemble time series of gene expression. By systematically analyzing the mutual information between input states and output states, one is able to infer the sets of input elements controlling each element or gene in the network. This process is unequivocal and exact for complete state transition tables. We implemented this REVerse Engineering ALgorithm (REVEAL) in a C program, and found the problem to be tractable within the conditions tested so far. For n = 50 (elements) and k = 3 (inputs per element), the analysis of incomplete state transition tables (100 state transition pairs out of a possible 10(exp 15)) reliably produced the original rule and wiring sets. While this study is limited to synchronous Boolean networks, the algorithm is generalizable to include multi-state models, essentially allowing direct application to realistic biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields.
Reveal, a general reverse engineering algorithm for inference of genetic network architectures.
Liang, S; Fuhrman, S; Somogyi, R
1998-01-01
Given the immanent gene expression mapping covering whole genomes during development, health and disease, we seek computational methods to maximize functional inference from such large data sets. Is it possible, in principle, to completely infer a complex regulatory network architecture from input/output patterns of its variables? We investigated this possibility using binary models of genetic networks. Trajectories, or state transition tables of Boolean nets, resemble time series of gene expression. By systematically analyzing the mutual information between input states and output states, one is able to infer the sets of input elements controlling each element or gene in the network. This process is unequivocal and exact for complete state transition tables. We implemented this REVerse Engineering ALgorithm (REVEAL) in a C program, and found the problem to be tractable within the conditions tested so far. For n = 50 (elements) and k = 3 (inputs per element), the analysis of incomplete state transition tables (100 state transition pairs out of a possible 10(15)) reliably produced the original rule and wiring sets. While this study is limited to synchronous Boolean networks, the algorithm is generalizable to include multi-state models, essentially allowing direct application to realistic biological data sets. The ability to adequately solve the inverse problem may enable in-depth analysis of complex dynamic systems in biology and other fields.
Evaluation of cache-based superscalar and cacheless vector architectures for scientific computations
Oliker, Leonid; Canning, Andrew; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; Van der Wijngaart, Rob
2003-05-01
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly out performs the cache-based architectures. However, certain applications are not easily amenable to vectorization and would re quire extensive algorithm and implementation reengineering to utilize the SX-6 effectively.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Carter, Jonathan; Shalf, John; Skinner, David; Ethier, Stephane; Biswas, Rupak; Djomehri, Jahed; VanderWijngaart, Rob
2003-01-01
The growing gap between sustained and peak performance for scientific applications has become a well-known problem in high performance computing. The recent development of parallel vector systems offers the potential to bridge this gap for a significant number of computational science codes and deliver a substantial increase in computing capabilities. This paper examines the intranode performance of the NEC SX6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of key scientific computing areas. First, we present the performance of a microbenchmark suite that examines a full spectrum of low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks using some simple optimizations. Finally, we evaluate the perfor- mance of several numerical codes from key scientific computing domains. Overall results demonstrate that the SX6 achieves high performance on a large fraction of our application suite and in many cases significantly outperforms the RISC-based architectures. However, certain classes of applications are not easily amenable to vectorization and would likely require extensive reengineering of both algorithm and implementation to utilize the SX6 effectively.
A Component Architecture for High-Performance Computing
Bernholdt, D E; Elwasif, W R; Kohl, J A; Epperly, T G W
2003-01-21
The Common Component Architecture (CCA) provides a means for developers to manage the complexity of large-scale scientific software systems and to move toward a ''plug and play'' environment for high-performance computing. The CCA model allows for a direct connection between components within the same process to maintain performance on inter-component calls. It is neutral with respect to parallelism, allowing components to use whatever means they desire to communicate within their parallel ''cohort.'' We will discuss in detail the importance of performance in the design of the CCA and will analyze the performance costs associated with features of the CCA.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Supporting Undergraduate Computer Architecture Students Using a Visual MIPS64 CPU Simulator
ERIC Educational Resources Information Center
Patti, D.; Spadaccini, A.; Palesi, M.; Fazzino, F.; Catania, V.
2012-01-01
The topics of computer architecture are always taught using an Assembly dialect as an example. The most commonly used textbooks in this field use the MIPS64 Instruction Set Architecture (ISA) to help students in learning the fundamentals of computer architecture because of its orthogonality and its suitability for real-world applications. This…
NASA Technical Reports Server (NTRS)
Rickard, D. A.; Bodenheimer, R. E.
1976-01-01
Digital computer components which perform two dimensional array logic operations (Tse logic) on binary data arrays are described. The properties of Golay transforms which make them useful in image processing are reviewed, and several architectures for Golay transform processors are presented with emphasis on the skeletonizing algorithm. Conventional logic control units developed for the Golay transform processors are described. One is a unique microprogrammable control unit that uses a microprocessor to control the Tse computer. The remaining control units are based on programmable logic arrays. Performance criteria are established and utilized to compare the various Golay transform machines developed. A critique of Tse logic is presented, and recommendations for additional research are included.
PIC codes for plasma accelerators on emerging computer architectures (GPUS, Multicore/Manycore CPUS)
NASA Astrophysics Data System (ADS)
Vincenti, Henri
2016-03-01
The advent of exascale computers will enable 3D simulations of a new laser-plasma interaction regimes that were previously out of reach of current Petasale computers. However, the paradigm used to write current PIC codes will have to change in order to fully exploit the potentialities of these new computing architectures. Indeed, achieving Exascale computing facilities in the next decade will be a great challenge in terms of energy consumption and will imply hardware developments directly impacting our way of implementing PIC codes. As data movement (from die to network) is by far the most energy consuming part of an algorithm future computers will tend to increase memory locality at the hardware level and reduce energy consumption related to data movement by using more and more cores on each compute nodes (''fat nodes'') that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, CPU machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. GPU's also have a reduced clock speed per core and can process Multiple Instructions on Multiple Datas (MIMD). At the software level Particle-In-Cell (PIC) codes will thus have to achieve both good memory locality and vectorization (for Multicore/Manycore CPU) to fully take advantage of these upcoming architectures. In this talk, we present the portable solutions we implemented in our high performance skeleton PIC code PICSAR to both achieve good memory locality and cache reuse as well as good vectorization on SIMD architectures. We also present the portable solutions used to parallelize the Pseudo-sepctral quasi-cylindrical code FBPIC on GPUs using the Numba python compiler.
Analysis of multigrid methods on massively parallel computers: Architectural implications
NASA Technical Reports Server (NTRS)
Matheson, Lesley R.; Tarjan, Robert E.
1993-01-01
We study the potential performance of multigrid algorithms running on massively parallel computers with the intent of discovering whether presently envisioned machines will provide an efficient platform for such algorithms. We consider the domain parallel version of the standard V cycle algorithm on model problems, discretized using finite difference techniques in two and three dimensions on block structured grids of size 10(exp 6) and 10(exp 9), respectively. Our models of parallel computation were developed to reflect the computing characteristics of the current generation of massively parallel multicomputers. These models are based on an interconnection network of 256 to 16,384 message passing, 'workstation size' processors executing in an SPMD mode. The first model accomplishes interprocessor communications through a multistage permutation network. The communication cost is a logarithmic function which is similar to the costs in a variety of different topologies. The second model allows single stage communication costs only. Both models were designed with information provided by machine developers and utilize implementation derived parameters. With the medium grain parallelism of the current generation and the high fixed cost of an interprocessor communication, our analysis suggests an efficient implementation requires the machine to support the efficient transmission of long messages, (up to 1000 words) or the high initiation cost of a communication must be significantly reduced through an alternative optimization technique. Furthermore, with variable length message capability, our analysis suggests the low diameter multistage networks provide little or no advantage over a simple single stage communications network.
Low-cost space-varying FIR filter architecture for computational imaging systems
NASA Astrophysics Data System (ADS)
Feng, Guotong; Shoaib, Mohammed; Schwartz, Edward L.; Dirk Robinson, M.
2010-01-01
Recent research demonstrates the advantage of designing electro-optical imaging systems by jointly optimizing the optical and digital subsystems. The optical systems designed using this joint approach intentionally introduce large and often space-varying optical aberrations that produce blurry optical images. Digital sharpening restores reduced contrast due to these intentional optical aberrations. Computational imaging systems designed in this fashion have several advantages including extended depth-of-field, lower system costs, and improved low-light performance. Currently, most consumer imaging systems lack the necessary computational resources to compensate for these optical systems with large aberrations in the digital processor. Hence, the exploitation of the advantages of the jointly designed computational imaging system requires low-complexity algorithms enabling space-varying sharpening. In this paper, we describe a low-cost algorithmic framework and associated hardware enabling the space-varying finite impulse response (FIR) sharpening required to restore largely aberrated optical images. Our framework leverages the space-varying properties of optical images formed using rotationally-symmetric optical lens elements. First, we describe an approach to leverage the rotational symmetry of the point spread function (PSF) about the optical axis allowing computational savings. Second, we employ a specially designed bank of sharpening filters tuned to the specific radial variation common to optical aberrations. We evaluate the computational efficiency and image quality achieved by using this low-cost space-varying FIR filter architecture.
Computational architecture for image processing on a small unmanned ground vehicle
NASA Astrophysics Data System (ADS)
Ho, Sean; Nguyen, Hung
2010-08-01
Man-portable Unmanned Ground Vehicles (UGVs) have been fielded on the battlefield with limited computing power. This limitation constrains their use primarily to teleoperation control mode for clearing areas and bomb defusing. In order to extend their capability to include the reconnaissance and surveillance missions of dismounted soldiers, a separate processing payload is desired. This paper presents a processing architecture and the design details on the payload module that enables the PackBot to perform sophisticated, real-time image processing algorithms using data collected from its onboard imaging sensors including LADAR, IMU, visible, IR, stereo, and the Ladybug spherical cameras. The entire payload is constructed from currently available Commercial off-the-shelf (COTS) components including an Intel multi-core CPU and a Nvidia GPU. The result of this work enables a small UGV to perform computationally expensive image processing tasks that once were only feasible on a large workstation.
Examining the architecture of cellular computing through a comparative study with a computer.
Wang, Degeng; Gribskov, Michael
2005-06-22
The computer and the cell both use information embedded in simple coding, the binary software code and the quadruple genomic code, respectively, to support system operations. A comparative examination of their system architecture as well as their information storage and utilization schemes is performed. On top of the code, both systems display a modular, multi-layered architecture, which, in the case of a computer, arises from human engineering efforts through a combination of hardware implementation and software abstraction. Using the computer as a reference system, a simplistic mapping of the architectural components between the two is easily detected. This comparison also reveals that a cell abolishes the software-hardware barrier through genomic encoding for the constituents of the biochemical network, a cell's "hardware" equivalent to the computer central processing unit (CPU). The information loading (gene expression) process acts as a major determinant of the encoded constituent's abundance, which, in turn, often determines the "bandwidth" of a biochemical pathway. Cellular processes are implemented in biochemical pathways in parallel manners. In a computer, on the other hand, the software provides only instructions and data for the CPU. A process represents just sequentially ordered actions by the CPU and only virtual parallelism can be implemented through CPU time-sharing. Whereas process management in a computer may simply mean job scheduling, coordinating pathway bandwidth through the gene expression machinery represents a major process management scheme in a cell. In summary, a cell can be viewed as a super-parallel computer, which computes through controlled hardware composition. While we have, at best, a very fragmented understanding of cellular operation, we have a thorough understanding of the computer throughout the engineering process. The potential utilization of this knowledge to the benefit of systems biology is discussed. PMID:16849179
An architecture for quantum computation with magnetically trapped Holmium atoms
NASA Astrophysics Data System (ADS)
Saffman, Mark; Hostetter, James; Booth, Donald; Collett, Jeffrey
2016-05-01
Outstanding challenges for scalable neutral atom quantum computation include correction of atom loss due to collisions with untrapped background gas, reduction of crosstalk during state preparation and measurement due to scattering of near resonant light, and the need to improve quantum gate fidelity. We present a scalable architecture based on loading single Holmium atoms into an array of Ioffe-Pritchard traps. The traps are formed by grids of superconducting wires giving a trap array with 40 μm period, suitable for entanglement via long range Rydberg gates. The states | F = 5 , M = 5 > and | F = 7 , M = 7 > provide a magic trapping condition at a low field of 3.5 G for long coherence time qubit encoding. The F = 11 level will be used for state preparation and measurement. The availability of different states for encoding, gate operations, and measurement, spectroscopically isolates the different operations and will prevent crosstalk to neighboring qubits. Operation in a cryogenic environment with ultra low pressure will increase atom lifetime and Rydberg gate fidelity by reduction of blackbody induced Rydberg decay. We will present a complete description of the architecture including estimates of achievable performance metrics. Work supported by NSF award PHY-1404357.
Chakrabarti, C. . Dept. of Electrical Engineering); Ja Ja, J. . Dept. of Electrical Engineering)
1990-11-01
This paper proposes two-dimensional systolic array implementations for computing the discrete Hartley (DHT) and the discrete cosine transforms (DCT) when the transform size N is decomposable into mutually prime factors. The existing two-dimensional formulations for DHT and DCT are modified and the corresponding algorithms are mapped into two-dimensional systolic arrays. The resulting architecture is fully pipelined with no control units. The hardware design is based on bit serial left to right MSB to LSB binary arithmetic.
Architecture-Aware Algorithms for Scalable Performance and Resilience on Heterogeneous Architectures
Dongarra, Jack
2013-03-14
There is a widening gap between the peak performance of high performance computers and the performance realized by full applications. Over the next decade, extreme-scale systems will present major new challenges to software development that could widen the gap so much that it prevents the productive use of future DOE Leadership computers.
Hardware Architectures for Data-Intensive Computing Problems: A Case Study for String Matching
Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel
2012-12-28
DNA analysis is an emerging application of high performance bioinformatic. Modern sequencing machinery are able to provide, in few hours, large input streams of data, which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. High performance systems are a promising platform to accelerate this algorithm, which is computationally intensive but also inherently parallel. Nowadays, high performance systems also include heterogeneous processing elements, such as Graphic Processing Units (GPUs), to further accelerate parallel algorithms. Unfortunately, the Aho-Corasick algorithm exhibits large performance variability, depending on the size of the input streams, on the number of patterns to search and on the number of matches, and poses significant challenges on current high performance software and hardware implementations. An adequate mapping of the algorithm on the target architecture, coping with the limit of the underlining hardware, is required to reach the desired high throughputs. In this paper, we discuss the implementation of the Aho-Corasick algorithm for GPU-accelerated high performance systems. We present an optimized implementation of Aho-Corasick for GPUs and discuss its tradeoffs on the Tesla T10 and he new Tesla T20 (codename Fermi) GPUs. We then integrate the optimized GPU code, respectively, in a MPI-based and in a pthreads-based load balancer to enable execution of the algorithm on clusters and large sharedmemory multiprocessors (SMPs) accelerated with multiple GPUs.
Real-time image analysis using wavelets: the "a trous" algorithm on MIMD architectures
NASA Astrophysics Data System (ADS)
Feil, Manfred; Uhl, Andreas
1999-03-01
The 'a trous' algorithm represents a discrete approach to the classical continuous wavelet transform. Similar to the fast wavelet transform the input signal is analyzed by using the coefficients of a properly chosen low-pass filter, but in contradistinction to the latter there follows no concluding decimation step. Examples of practical applications can be found in the field of cosmology for studying the formation of large scale structures of the Universe. In this paper we develop parallel algorithms on different MIMD architectures for the 2D 'a trous' decomposition. We implement the algorithm on several distributed memory architectures using the PVM paradigm and on a SGI POWERChallenge using a parallel version of the C programming language. Finally we investigate experimental results obtained on both of them.
The Use of Alternative Algorithms in Whole Number Computation
ERIC Educational Resources Information Center
Norton, Stephen
2012-01-01
Pedagogical reform in Australia over the last few decades has resulted in a reduced emphasis on the teaching of computational algorithms and a diversity of alternative mechanisms to teach students whole number computations. The effect of these changes upon student recording of whole number computations has had little empirical investigation. As…
A novel bit-quad-based Euler number computing algorithm.
Yao, Bin; He, Lifeng; Kang, Shiying; Chao, Yuyan; Zhao, Xiao
2015-01-01
The Euler number of a binary image is an important topological property in computer vision and pattern recognition. This paper proposes a novel bit-quad-based Euler number computing algorithm. Based on graph theory and analysis on bit-quad patterns, our algorithm only needs to count two bit-quad patterns. Moreover, by use of the information obtained during processing the previous bit-quad, the average number of pixels to be checked for processing a bit-quad is only 1.75. Experimental results demonstrated that our method outperforms significantly conventional Euler number computing algorithms. PMID:26636023
A novel bit-quad-based Euler number computing algorithm.
Yao, Bin; He, Lifeng; Kang, Shiying; Chao, Yuyan; Zhao, Xiao
2015-01-01
The Euler number of a binary image is an important topological property in computer vision and pattern recognition. This paper proposes a novel bit-quad-based Euler number computing algorithm. Based on graph theory and analysis on bit-quad patterns, our algorithm only needs to count two bit-quad patterns. Moreover, by use of the information obtained during processing the previous bit-quad, the average number of pixels to be checked for processing a bit-quad is only 1.75. Experimental results demonstrated that our method outperforms significantly conventional Euler number computing algorithms.
NASA Astrophysics Data System (ADS)
Tramm, John R.; Gunow, Geoffrey; He, Tim; Smith, Kord S.; Forget, Benoit; Siegel, Andrew R.
2016-05-01
In this study we present and analyze a formulation of the 3D Method of Characteristics (MOC) technique applied to the simulation of full core nuclear reactors. Key features of the algorithm include a task-based parallelism model that allows independent MOC tracks to be assigned to threads dynamically, ensuring load balancing, and a wide vectorizable inner loop that takes advantage of modern SIMD computer architectures. The algorithm is implemented in a set of highly optimized proxy applications in order to investigate its performance characteristics on CPU, GPU, and Intel Xeon Phi architectures. Speed, power, and hardware cost efficiencies are compared. Additionally, performance bottlenecks are identified for each architecture in order to determine the prospects for continued scalability of the algorithm on next generation HPC architectures.
A novel clustering algorithm inspired by membrane computing.
Peng, Hong; Luo, Xiaohui; Gao, Zhisheng; Wang, Jun; Pei, Zheng
2015-01-01
P systems are a class of distributed parallel computing models; this paper presents a novel clustering algorithm, which is inspired from mechanism of a tissue-like P system with a loop structure of cells, called membrane clustering algorithm. The objects of the cells express the candidate centers of clusters and are evolved by the evolution rules. Based on the loop membrane structure, the communication rules realize a local neighborhood topology, which helps the coevolution of the objects and improves the diversity of objects in the system. The tissue-like P system can effectively search for the optimal partitioning with the help of its parallel computing advantage. The proposed clustering algorithm is evaluated on four artificial data sets and six real-life data sets. Experimental results show that the proposed clustering algorithm is superior or competitive to k-means algorithm and several evolutionary clustering algorithms recently reported in the literature.
A Novel Clustering Algorithm Inspired by Membrane Computing
Luo, Xiaohui; Gao, Zhisheng; Wang, Jun; Pei, Zheng
2015-01-01
P systems are a class of distributed parallel computing models; this paper presents a novel clustering algorithm, which is inspired from mechanism of a tissue-like P system with a loop structure of cells, called membrane clustering algorithm. The objects of the cells express the candidate centers of clusters and are evolved by the evolution rules. Based on the loop membrane structure, the communication rules realize a local neighborhood topology, which helps the coevolution of the objects and improves the diversity of objects in the system. The tissue-like P system can effectively search for the optimal partitioning with the help of its parallel computing advantage. The proposed clustering algorithm is evaluated on four artificial data sets and six real-life data sets. Experimental results show that the proposed clustering algorithm is superior or competitive to k-means algorithm and several evolutionary clustering algorithms recently reported in the literature. PMID:25874264
E-Governance and Service Oriented Computing Architecture Model
NASA Astrophysics Data System (ADS)
Tejasvee, Sanjay; Sarangdevot, S. S.
2010-11-01
E-Governance is the effective application of information communication and technology (ICT) in the government processes to accomplish safe and reliable information lifecycle management. Lifecycle of the information involves various processes as capturing, preserving, manipulating and delivering information. E-Governance is meant to transform of governance in better manner to the citizens which is transparent, reliable, participatory, and accountable in point of view. The purpose of this paper is to attempt e-governance model, focus on the Service Oriented Computing Architecture (SOCA) that includes combination of information and services provided by the government, innovation, find out the way of optimal service delivery to citizens and implementation in transparent and liable practice. This paper also try to enhance focus on the E-government Service Manager as a essential or key factors service oriented and computing model that provides a dynamically extensible structural design in which all area or branch can bring in innovative services. The heart of this paper examine is an intangible model that enables E-government communication for trade and business, citizen and government and autonomous bodies.
NASA Technical Reports Server (NTRS)
Gentzsch, W.
1982-01-01
Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.
A fast algorithm for sparse matrix computations related to inversion
NASA Astrophysics Data System (ADS)
Li, S.; Wu, W.; Darve, E.
2013-06-01
We have developed a fast algorithm for computing certain entries of the inverse of a sparse matrix. Such computations are critical to many applications, such as the calculation of non-equilibrium Green's functions Gr and G< for nano-devices. The FIND (Fast Inverse using Nested Dissection) algorithm is optimal in the big-O sense. However, in practice, FIND suffers from two problems due to the width-2 separators used by its partitioning scheme. One problem is the presence of a large constant factor in the computational cost of FIND. The other problem is that the partitioning scheme used by FIND is incompatible with most existing partitioning methods and libraries for nested dissection, which all use width-1 separators. Our new algorithm resolves these problems by thoroughly decomposing the computation process such that width-1 separators can be used, resulting in a significant speedup over FIND for realistic devices — up to twelve-fold in simulation. The new algorithm also has the added advantage that desired off-diagonal entries can be computed for free. Consequently, our algorithm is faster than the current state-of-the-art recursive methods for meshes of any size. Furthermore, the framework used in the analysis of our algorithm is the first attempt to explicitly apply the widely-used relationship between mesh nodes and matrix computations to the problem of multiple eliminations with reuse of intermediate results. This framework makes our algorithm easier to generalize, and also easier to compare against other methods related to elimination trees. Finally, our accuracy analysis shows that the algorithms that require back-substitution are subject to significant extra round-off errors, which become extremely large even for some well-conditioned matrices or matrices with only moderately large condition numbers. When compared to these back-substitution algorithms, our algorithm is generally a few orders of magnitude more accurate, and our produced round-off errors
Efficient Homotopy Continuation Algorithms with Application to Computational Fluid Dynamics
NASA Astrophysics Data System (ADS)
Brown, David A.
New homotopy continuation algorithms are developed and applied to a parallel implicit finite-difference Newton-Krylov-Schur external aerodynamic flow solver for the compressible Euler, Navier-Stokes, and Reynolds-averaged Navier-Stokes equations with the Spalart-Allmaras one-equation turbulence model. Many new analysis tools, calculations, and numerical algorithms are presented for the study and design of efficient and robust homotopy continuation algorithms applicable to solving very large and sparse nonlinear systems of equations. Several specific homotopies are presented and studied and a methodology is presented for assessing the suitability of specific homotopies for homotopy continuation. . A new class of homotopy continuation algorithms, referred to as monolithic homotopy continuation algorithms, is developed. These algorithms differ from classical predictor-corrector algorithms by combining the predictor and corrector stages into a single update, significantly reducing the amount of computation and avoiding wasted computational effort resulting from over-solving in the corrector phase. The new algorithms are also simpler from a user perspective, with fewer input parameters, which also improves the user's ability to choose effective parameters on the first flow solve attempt. Conditional convergence is proved analytically and studied numerically for the new algorithms. The performance of a fully-implicit monolithic homotopy continuation algorithm is evaluated for several inviscid, laminar, and turbulent flows over NACA 0012 airfoils and ONERA M6 wings. The monolithic algorithm is demonstrated to be more efficient than the predictor-corrector algorithm for all applications investigated. It is also demonstrated to be more efficient than the widely-used pseudo-transient continuation algorithm for all inviscid and laminar cases investigated, and good performance scaling with grid refinement is demonstrated for the inviscid cases. Performance is also demonstrated
AES-128 Bit Algorithm Using Fully Pipelined Architecture for Secret Communication
NASA Astrophysics Data System (ADS)
Gnanambika, M.; Adilakshmi, S.; Noorbasha, Fazal
2012-03-01
In this paper, an efficient method for high speed hardware implementation of AES algorithm is presented. So far, many implementations of AES have been proposed, for various goals that effect the Sub Byte transformation in various ways. These methods of implementation are based on combinational logic and are done in polynomial bases. In the proposed architecture, it is done by using composite field arithmetic in normal bases. In addition, efficient key expansion architecture suitable for 6 sub pipelined round units is also presented. These designs were described using VerilogHDL, simulated using Modelsim.
Computational Fluency, Algorithms, and Mathematical Proficiency: One Mathematician's Perspective.
ERIC Educational Resources Information Center
Bass, Hyman
2003-01-01
Suggests that algorithms, both traditional and student-invented, are proper objects of study not only as tools for computation, but also for understanding the nature of the operations of arithmetic. (Author/NB)
Parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Amin-Javaheri, Masoud; Orin, David E.
1989-01-01
The development of an O(log2N) parallel algorithm for the manipulator inertia matrix is presented. It is based on the most efficient serial algorithm which uses the composite rigid body method. Recursive doubling is used to reformulate the linear recurrence equations which are required to compute the diagonal elements of the matrix. It results in O(log2N) levels of computation. Computation of the off-diagonal elements involves N linear recurrences of varying-size and a new method, which avoids redundant computation of position and orientation transforms for the manipulator, is developed. The O(log2N) algorithm is presented in both equation and graphic forms which clearly show the parallelism inherent in the algorithm.
Free energy computations employing Jarzynski identity and Wang - Landau algorithm
NASA Astrophysics Data System (ADS)
Kalyan, M. Suman; Murthy, K. P. N.; Sastry, V. S. S.
2016-05-01
We introduce a simple method to compute free energy differences employing Jarzynski identity in conjunction with Wang - Landau algorithm. We demonstrate this method on Ising spin system by comparing the results with those obtained from canonical sampling.
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention
A computational fluid dynamics algorithm on a massively parallel computer
NASA Technical Reports Server (NTRS)
Jespersen, Dennis C.; Levit, Creon
1989-01-01
The implementation and performance of a finite-difference algorithm for the compressible Navier-Stokes equations in two or three dimensions on the Connection Machine are described. This machine is a single-instruction multiple-data machine with up to 65536 physical processors. The implicit portion of the algorithm is of particular interest. Running times and megadrop rates are given for two- and three-dimensional problems. Included are comparisons with the standard codes on a Cray X-MP/48.
Algorithm for Computing Particle/Surface Interactions
NASA Technical Reports Server (NTRS)
Hughes, David W.
2009-01-01
An algorithm has been devised for predicting the behaviors of sparsely spatially distributed particles impinging on a solid surface in a rarefied atmosphere. Under the stated conditions, prior particle-transport models in which (1) dense distributions of particles are treated as continuum fluids; or (2) sparse distributions of particles are considered to be suspended in and to diffuse through fluid streams are not valid.
Algorithm implementation on the Navier-Stokes computer
NASA Technical Reports Server (NTRS)
Krist, Steven E.; Zang, Thomas A.
1987-01-01
The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Neural computing architectures: The design of brain-like machines
Aleksander, I.
1989-01-01
Theoretical and applications aspects of neural-network (NN) computers are discussed in chapters contributed by European experts. Topics addressed include speech recognition based on topology-preserving neural maps, neural-map applications, backpropagation in nonfeedforward NNs, a parallel-distributed-processing learning approach to natural language, the learning capabilities of Boolean NNs, the logic of connectionist systems, and a probabilistic-logic NN for associative learning. Consideration is given to N-tuple sampling and genetic algorithms for speech recognition; the dynamic behavior of Boolean NNs; statistical mechanics and NNs; digital NNs, matched filters, and optical implementations; heteroassociative NNs using cabling vs link-disabling local modification rules; and the generation of movement trajectories in primates and robots. Also provided is an overview of parallel distributed processing.
A class of least-squares filtering and identification algorithms with systolic array architectures
NASA Technical Reports Server (NTRS)
Kalson, Seth Z.; Yao, Kung
1991-01-01
A unified approach is presented for deriving a large class of new and previously known time- and order-recursive least-squares algorithms with systolic array architectures, suitable for high-throughput-rate and VLSI implementations of space-time filtering and system identification problems. The geometrical derivation given is unique in that no assumption is made concerning the rank of the sample data correlation matrix. This method utilizes and extends the concept of oblique projections, as used previously in the derivations of the least-squares lattice algorithms. Exponentially weighted least-squares criteria are considered for both sliding and growing memory.
NASA Technical Reports Server (NTRS)
Liu, Kuojuey Ray
1990-01-01
Least-squares (LS) estimations and spectral decomposition algorithms constitute the heart of modern signal processing and communication problems. Implementations of recursive LS and spectral decomposition algorithms onto parallel processing architectures such as systolic arrays with efficient fault-tolerant schemes are the major concerns of this dissertation. There are four major results in this dissertation. First, we propose the systolic block Householder transformation with application to the recursive least-squares minimization. It is successfully implemented on a systolic array with a two-level pipelined implementation at the vector level as well as at the word level. Second, a real-time algorithm-based concurrent error detection scheme based on the residual method is proposed for the QRD RLS systolic array. The fault diagnosis, order degraded reconfiguration, and performance analysis are also considered. Third, the dynamic range, stability, error detection capability under finite-precision implementation, order degraded performance, and residual estimation under faulty situations for the QRD RLS systolic array are studied in details. Finally, we propose the use of multi-phase systolic algorithms for spectral decomposition based on the QR algorithm. Two systolic architectures, one based on triangular array and another based on rectangular array, are presented for the multiphase operations with fault-tolerant considerations. Eigenvectors and singular vectors can be easily obtained by using the multi-pase operations. Performance issues are also considered.
NASA Astrophysics Data System (ADS)
Schämann, M.; Bücker, M.; Hessel, S.; Langmann, U.
2008-05-01
High data rates combined with high mobility represent a challenge for the design of cellular devices. Advanced algorithms are required which result in higher complexity, more chip area and increased power consumption. However, this contrasts to the limited power supply of mobile devices. This presentation discusses the application of an HSDPA receiver which has been optimized regarding power consumption with the focus on the algorithmic and architectural level. On algorithmic level the Rake combiner, Prefilter-Rake equalizer and MMSE equalizer are compared regarding their BER performance. Both equalizer approaches provide a significant increase of performance for high data rates compared to the Rake combiner which is commonly used for lower data rates. For both equalizer approaches several adaptive algorithms are available which differ in complexity and convergence properties. To identify the algorithm which achieves the required performance with the lowest power consumption the algorithms have been investigated using SystemC models regarding their performance and arithmetic complexity. Additionally, for the Prefilter Rake equalizer the power estimations of a modified Griffith (LMS) and a Levinson (RLS) algorithm have been compared with the tool ORINOCO supplied by ChipVision. The accuracy of this tool has been verified with a scalable architecture of the UMTS channel estimation described both in SystemC and VHDL targeting a 130 nm CMOS standard cell library. An architecture combining all three approaches combined with an adaptive control unit is presented. The control unit monitors the current condition of the propagation channel and adjusts parameters for the receiver like filter size and oversampling ratio to minimize the power consumption while maintaining the required performance. The optimization strategies result in a reduction of the number of arithmetic operations up to 70% for single components which leads to an estimated power reduction of up to 40
Visual pattern recognition network: its training algorithm and its optoelectronic architecture
NASA Astrophysics Data System (ADS)
Wang, Ning; Liu, Liren
1996-07-01
A visual pattern recognition network and its training algorithm are proposed. The network constructed of a one-layer morphology network and a two-layer modified Hamming net. This visual network can implement invariant pattern recognition with respect to image translation and size projection. After supervised learning takes place, the visual network extracts image features and classifies patterns much the same as living beings do. Moreover we set up its optoelectronic architecture for real-time pattern recognition.
Applying a cloud computing approach to storage architectures for spacecraft
NASA Astrophysics Data System (ADS)
Baldor, Sue A.; Quiroz, Carlos; Wood, Paul
As sensor technologies, processor speeds, and memory densities increase, spacecraft command, control, processing, and data storage systems have grown in complexity to take advantage of these improvements and expand the possible missions of spacecraft. Spacecraft systems engineers are increasingly looking for novel ways to address this growth in complexity and mitigate associated risks. Looking to conventional computing, many solutions have been executed to solve both the problem of complexity and heterogeneity in systems. In particular, the cloud-based paradigm provides a solution for distributing applications and storage capabilities across multiple platforms. In this paper, we propose utilizing a cloud-like architecture to provide a scalable mechanism for providing mass storage in spacecraft networks that can be reused on multiple spacecraft systems. By presenting a consistent interface to applications and devices that request data to be stored, complex systems designed by multiple organizations may be more readily integrated. Behind the abstraction, the cloud storage capability would manage wear-leveling, power consumption, and other attributes related to the physical memory devices, critical components in any mass storage solution for spacecraft. Our approach employs SpaceWire networks and SpaceWire-capable devices, although the concept could easily be extended to non-heterogeneous networks consisting of multiple spacecraft and potentially the ground segment.
Hybrid VLSI/QCA Architecture for Computing FFTs
NASA Technical Reports Server (NTRS)
Fijany, Amir; Toomarian, Nikzad; Modarres, Katayoon; Spotnitz, Matthew
2003-01-01
A data-processor architecture that would incorporate elements of both conventional very-large-scale integrated (VLSI) circuitry and quantum-dot cellular automata (QCA) has been proposed to enable the highly parallel and systolic computation of fast Fourier transforms (FFTs). The proposed circuit would complement the QCA-based circuits described in several prior NASA Tech Briefs articles, namely Implementing Permutation Matrices by Use of Quantum Dots (NPO-20801), Vol. 25, No. 10 (October 2001), page 42; Compact Interconnection Networks Based on Quantum Dots (NPO-20855) Vol. 27, No. 1 (January 2003), page 32; and Bit-Serial Adder Based on Quantum Dots (NPO-20869), Vol. 27, No. 1 (January 2003), page 35. The cited prior articles described the limitations of very-large-scale integrated (VLSI) circuitry and the major potential advantage afforded by QCA. To recapitulate: In a VLSI circuit, signal paths that are required not to interact with each other must not cross in the same plane. In contrast, for reasons too complex to describe in the limited space available for this article, suitably designed and operated QCAbased signal paths that are required not to interact with each other can nevertheless be allowed to cross each other in the same plane without adverse effect. In principle, this characteristic could be exploited to design compact, coplanar, simple (relative to VLSI) QCA-based networks to implement complex, advanced interconnection schemes.
Algorithms and software for solving finite element equations on serial and parallel architectures
NASA Technical Reports Server (NTRS)
George, Alan
1989-01-01
Over the past 15 years numerous new techniques have been developed for solving systems of equations and eigenvalue problems arising in finite element computations. A package called SPARSPAK has been developed by the author and his co-workers which exploits these new methods. The broad objective of this research project is to incorporate some of this software in the Computational Structural Mechanics (CSM) testbed, and to extend the techniques for use on multiprocessor architectures.
Decomposition algorithms for stochastic programming on a computational grid.
Linderoth, J.; Wright, S.; Mathematics and Computer Science; Axioma Inc.
2003-01-01
We describe algorithms for two-stage stochastic linear programming with recourse and their implementation on a grid computing platform. In particular, we examine serial and asynchronous versions of the L-shaped method and a trust-region method. The parallel platform of choice is the dynamic, heterogeneous, opportunistic platform provided by the Condor system. The algorithms are of master-worker type (with the workers being used to solve second-stage problems), and the MW runtime support library (which supports master-worker computations) is key to the implementation. Computational results are presented on large sample-average approximations of problems from the literature.
On the performances of computer vision algorithms on mobile platforms
NASA Astrophysics Data System (ADS)
Battiato, S.; Farinella, G. M.; Messina, E.; Puglisi, G.; Ravì, D.; Capra, A.; Tomaselli, V.
2012-01-01
Computer Vision enables mobile devices to extract the meaning of the observed scene from the information acquired with the onboard sensor cameras. Nowadays, there is a growing interest in Computer Vision algorithms able to work on mobile platform (e.g., phone camera, point-and-shot-camera, etc.). Indeed, bringing Computer Vision capabilities on mobile devices open new opportunities in different application contexts. The implementation of vision algorithms on mobile devices is still a challenging task since these devices have poor image sensors and optics as well as limited processing power. In this paper we have considered different algorithms covering classic Computer Vision tasks: keypoint extraction, face detection, image segmentation. Several tests have been done to compare the performances of the involved mobile platforms: Nokia N900, LG Optimus One, Samsung Galaxy SII.
The dynamic Allan variance II: a fast computational algorithm.
Galleani, Lorenzo
2010-01-01
The stability of an atomic clock can change with time due to several factors, such as temperature, humidity, radiations, aging, and sudden breakdowns. The dynamic Allan variance, or DAVAR, is a representation of the time-varying stability of an atomic clock, and it can be used to monitor the clock behavior. Unfortunately, the computational time of the DAVAR grows very quickly with the length of the analyzed time series. In this article, we present a fast algorithm for the computation of the DAVAR, and we also extend it to the case of missing data. Numerical simulations show that the fast algorithm dramatically reduces the computational time. The fast algorithm is useful when the analyzed time series is long, or when many clocks must be monitored, or when the computational power is low, as happens onboard satellites and space probes.
Durand, Melissa A; Wang, Steven; Hooley, Regina J; Raghu, Madhavi; Philpotts, Liane E
2016-01-01
As use of digital breast tomosynthesis becomes increasingly widespread, new management challenges are inevitable because tomosynthesis may reveal suspicious lesions not visible at conventional two-dimensional (2D) full-field digital mammography. Architectural distortion is a mammographic finding associated with a high positive predictive value for malignancy. It is detected more frequently at tomosynthesis than at 2D digital mammography and may even be occult at conventional 2D imaging. Few studies have focused on tomosynthesis-detected architectural distortions to date, and optimal management of these distortions has yet to be well defined. Since implementing tomosynthesis at our institution in 2011, we have learned some practical ways to assess architectural distortion. Because distortions may be subtle, tomosynthesis localization tools plus improved visualization of adjacent landmarks are crucial elements in guiding mammographic identification of elusive distortions. These same tools can guide more focused ultrasonography (US) of the breast, which facilitates detection and permits US-guided tissue sampling. Some distortions may be sonographically occult, in which case magnetic resonance imaging may be a reasonable option, both to increase diagnostic confidence and to provide a means for image-guided biopsy. As an alternative, tomosynthesis-guided biopsy, conventional stereotactic biopsy (when possible), or tomosynthesis-guided needle localization may be used to achieve tissue diagnosis. Practical uses for tomosynthesis in evaluation of architectural distortion are highlighted, potential complications are identified, and a working algorithm for management of tomosynthesis-detected architectural distortion is proposed. PMID:26963448
Qhull: Quickhull algorithm for computing the convex hull
NASA Astrophysics Data System (ADS)
Barber, C. Bradford; Dobkin, David P.; Huhdanpaa, Hannu
2013-04-01
Qhull computes the convex hull, Delaunay triangulation, Voronoi diagram, halfspace intersection about a point, furthest-site Delaunay triangulation, and furthest-site Voronoi diagram. The source code runs in 2-d, 3-d, 4-d, and higher dimensions. Qhull implements the Quickhull algorithm for computing the convex hull. It handles roundoff errors from floating point arithmetic. It computes volumes, surface areas, and approximations to the convex hull.
Computation of multiple Lie derivatives by algorithmic differentiation
NASA Astrophysics Data System (ADS)
Robenack, Klaus
2008-04-01
Lie derivatives are often used in nonlinear control and system theory. In general, these Lie derivatives are computed symbolically using computer algebra software. Although this approach is well-suited for small and medium-size problems, it is difficult to apply this technique to very complicated systems. We suggest an alternative method to compute the values of iterated and mixed Lie derivatives by algorithmic differentiation.
Computational Algorithms for Device-Circuit Coupling
KEITER, ERIC R.; HUTCHINSON, SCOTT A.; HOEKSTRA, ROBERT J.; RANKIN, ERIC LAMONT; RUSSO, THOMAS V.; WATERS, LON J.
2003-01-01
Circuit simulation tools (e.g., SPICE) have become invaluable in the development and design of electronic circuits. Similarly, device-scale simulation tools (e.g., DaVinci) are commonly used in the design of individual semiconductor components. Some problems, such as single-event upset (SEU), require the fidelity of a mesh-based device simulator but are only meaningful when dynamically coupled with an external circuit. For such problems a mixed-level simulator is desirable, but the two types of simulation generally have different (sometimes conflicting) numerical requirements. To address these considerations, we have investigated variations of the two-level Newton algorithm, which preserves tight coupling between the circuit and the partial differential equations (PDE) device, while optimizing the numerics for both.
A simple algorithm for computing canonical forms
NASA Technical Reports Server (NTRS)
Ford, H.; Hunt, L. R.; Renjeng, S.
1986-01-01
It is well known that all linear time-invariant controllable systems can be transformed to Brunovsky canonical form by a transformation consisting only of coordinate changes and linear feedback. However, the actual procedures for doing this have tended to be overly complex. The technique introduced here is envisioned as an on-line procedure and is inspired by George Meyer's tangent model for nonlinear systems. The process utilizes Meyer's block triangular form as an intermedicate step in going to Brunovsky form. The method also involves orthogonal matrices, thus eliminating the need for the computation of matrix inverses. In addition, the Kronecker indices can be computed as a by-product of this transformation so it is necessary to know them in advance.
Feature detection algorithms in computed images
NASA Astrophysics Data System (ADS)
Gurbuz, Ali Cafer
2008-10-01
The problem of sensing a medium by several sensors and retrieving interesting features is a very general one. The basic framework is generally the same for applications from MRI, tomography, Radar SAR imaging to subsurface imaging, even though the data acquisition processes, sensing geometries and sensed properties are different. In this thesis we introduced a new perspective to the problem of remote sensing and information retrieval by studying the problem of subsurface imaging using GPR and seismic sensors. We have shown that if the sensed medium is sparse in some domain then it can be imaged using many fewer measurements than required by the standard methods. This leads to much lower data acquisition times and better images. We have used the ideas from Compressive Sensing, which show that a small number of random measurements about a signal are sufficient to completely characterize it, if the signal is sparse or compressible in some domain. Although we have applied our ideas to the subsurface imaging problem, our results are general and can be extended to other remote sensing applications. A second objective in remote sensing is information retrieval which involves searching for important features in the computed image. In this thesis we focus on detecting buried structures like pipes, and tunnels in computed GPR or seismic images. The problem of finding these structures in high clutter and noise conditions, and finding them faster than the standard shape detecting methods is analyzed. One of the most important contributions of this thesis is where the sensing and the information retrieval stages are unified in a single framework using compressive sensing. Instead of taking lots of standard measurements to compute the image of the medium and search the necessary information in the computed image, only a small number of measurements as random projections are used to infer the information without generating the image of the medium.
Parallel algorithm for computing 3-D reachable workspaces
NASA Astrophysics Data System (ADS)
Alameldin, Tarek K.; Sobh, Tarek M.
1992-03-01
The problem of computing the 3-D workspace for redundant articulated chains has applications in a variety of fields such as robotics, computer aided design, and computer graphics. The computational complexity of the workspace problem is at least NP-hard. The recent advent of parallel computers has made practical solutions for the workspace problem possible. Parallel algorithms for computing the 3-D workspace for redundant articulated chains with joint limits are presented. The first phase of these algorithms computes workspace points in parallel. The second phase uses workspace points that are computed in the first phase and fits a 3-D surface around the volume that encompasses the workspace points. The second phase also maps the 3- D points into slices, uses region filling to detect the holes and voids in the workspace, extracts the workspace boundary points by testing the neighboring cells, and tiles the consecutive contours with triangles. The proposed algorithms are efficient for computing the 3-D reachable workspace for articulated linkages, not only those with redundant degrees of freedom but also those with joint limits.
Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees
Grupcev, Vladimir; Yuan, Yongke; Tu, Yi-Cheng; Huang, Jin; Chen, Shaoping; Pandit, Sagar; Weng, Michael
2014-01-01
Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations impose great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the system size N. We also develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis. PMID:24693210
Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees.
Grupcev, Vladimir; Yuan, Yongke; Tu, Yi-Cheng; Huang, Jin; Chen, Shaoping; Pandit, Sagar; Weng, Michael
2012-09-01
Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations impose great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the system size N. We also develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis.
Data Compression Algorithm Architecture for Large Depth-of-Field Particle Image Velocimeters
NASA Technical Reports Server (NTRS)
Bos, Brent; Memarsadeghi, Nargess; Kizhner, Semion; Antonille, Scott
2013-01-01
A large depth-of-field particle image velocimeter (PIV) is designed to characterize dynamic dust environments on planetary surfaces. This instrument detects lofted dust particles, and senses the number of particles per unit volume, measuring their sizes, velocities (both speed and direction), and shape factors when the particles are large. To measure these particle characteristics in-flight, the instrument gathers two-dimensional image data at a high frame rate, typically >4,000 Hz, generating large amounts of data for every second of operation, approximately 6 GB/s. To characterize a planetary dust environment that is dynamic, the instrument would have to operate for at least several minutes during an observation period, easily producing more than a terabyte of data per observation. Given current technology, this amount of data would be very difficult to store onboard a spacecraft, and downlink to Earth. Since 2007, innovators have been developing an autonomous image analysis algorithm architecture for the PIV instrument to greatly reduce the amount of data that it has to store and downlink. The algorithm analyzes PIV images and automatically reduces the image information down to only the particle measurement data that is of interest, reducing the amount of data that is handled by more than 10(exp 3). The state of development for this innovation is now fairly mature, with a functional algorithm architecture, along with several key pieces of algorithm logic, that has been proven through field test data acquired with a proof-of-concept PIV instrument.
A computer algorithm for automatic beam steering
Drennan, E.
1992-06-01
Beam steering is done by modifying the current in a trim or bending magnet. If the current change is the right amount the beam can be made to bend in such a manner that it will hit a swic or BPM downstream from the magnet at a predetermined set point. Although both bending magnets and trim magnets can be used to modify beam angle, beam steering is usually done with trim magnets. This is so because, during beam steering the beam angle is usually modified only by a small amount which can be easily achieved with a trim magnet. Thus in this note, all steering magnets will be assumed to be trim magnets. There are two ways of monitoring beam position. One way is done using a BPM and the other is done using a swic. For simplicity, beam position monitoring in this paper will be referred to being done with a swic. Beam steering can be done manually by changing the current through a trim magnet and monitoring the position of the beam downstream from the magnet with a swic. Alternatively the beam can be positioned automatically using a computer which periodically updates the current through a specific number of trim magnets. The purpose of this note is to describe the steps involved in coming up with such a computer program. There are two main aspects to automatic beam steering. First a relationship between the beam position and the bending magnet is needed. Secondly a beamline setup of swics and trim magnets has to be chosen that will position the beam according to the desired specifications. A simple example will be looked at that will show that once a mathematical relationship between the needed change of the beam position on a swic and the change in trim currents is established, a computer could be programmed to calculate and update the trim currents.
Gradient Learning Algorithms for Ontology Computing
Gao, Wei; Zhu, Linli
2014-01-01
The gradient learning model has been raising great attention in view of its promising perspectives for applications in statistics, data dimensionality reducing, and other specific fields. In this paper, we raise a new gradient learning model for ontology similarity measuring and ontology mapping in multidividing setting. The sample error in this setting is given by virtue of the hypothesis space and the trick of ontology dividing operator. Finally, two experiments presented on plant and humanoid robotics field verify the efficiency of the new computation model for ontology similarity measure and ontology mapping applications in multidividing setting. PMID:25530752
Application of a new finite difference algorithm for computational aeroacoustics
NASA Technical Reports Server (NTRS)
Goodrich, John W.
1995-01-01
Acoustic problems have become extremely important in recent years because of research efforts such as the High Speed Civil Transport program. Computational aeroacoustics (CAA) requires a faithful representation of wave propagation over long distances, and needs algorithms that are accurate and boundary conditions that are unobtrusive. This paper applies a new finite difference method and boundary algorithm to the Linearized Euler Equations (LEE). The results demonstrate the ability of a new fourth order propagation algorithm to accurately simulate the genuinely multidimensional wave dynamics of acoustic propagation in two space dimensions with the LEE. The results also show the ability of a new outflow boundary condition and fourth order algorithm to pass the evolving solution from the computational domain with no perceptible degradation of the solution remaining within the domain.
General purpose architecture for intelligent computer-aided training
NASA Technical Reports Server (NTRS)
Loftin, R. Bowen (Inventor); Wang, Lui (Inventor); Baffes, Paul T. (Inventor); Hua, Grace C. (Inventor)
1994-01-01
An intelligent computer-aided training system having a general modular architecture is provided for use in a wide variety of training tasks and environments. It is comprised of a user interface which permits the trainee to access the same information available in the task environment and serves as a means for the trainee to assert actions to the system; a domain expert which is sufficiently intelligent to use the same information available to the trainee and carry out the task assigned to the trainee; a training session manager for examining the assertions made by the domain expert and by the trainee for evaluating such trainee assertions and providing guidance to the trainee which are appropriate to his acquired skill level; a trainee model which contains a history of the trainee interactions with the system together with summary evaluative data; an intelligent training scenario generator for designing increasingly complex training exercises based on the current skill level contained in the trainee model and on any weaknesses or deficiencies that the trainee has exhibited in previous interactions; and a blackboard that provides a common fact base for communication between the other components of the system. Preferably, the domain expert contains a list of 'mal-rules' which typifies errors that are usually made by novice trainees. Also preferably, the training session manager comprises an intelligent error detection means and an intelligent error handling means. The present invention utilizes a rule-based language having a control structure whereby a specific message passing protocol is utilized with respect to tasks which are procedural or step-by-step in structure. The rules can be activated by the trainee in any order to reach the solution by any valid or correct path.
NASA Technical Reports Server (NTRS)
1985-01-01
Slides are reproduced that describe the importance of having high performance number crunching and graphics capability. They also indicate the types of research and development underway at Ames Research Center to ensure that, in the near term, Ames is a smart buyer and user, and in the long-term that Ames knows the best possible solutions for number crunching and graphics needs. The drivers for this research are real computational physics applications of interest to Ames and NASA. They are concerned with how to map the applications, and how to maximize the physics learned from the results of the calculations. The computer graphics activities are aimed at getting maximum information from the three-dimensional calculations by using the real time manipulation of three-dimensional data on the Silicon Graphics workstation. Work is underway on new algorithms that will permit the display of experimental results that are sparse and random, the same way that the dense and regular computed results are displayed.
An algorithm for computationally expensive engineering optimization problems
NASA Astrophysics Data System (ADS)
Yoel, Tenne
2013-07-01
Modern engineering design often relies on computer simulations to evaluate candidate designs, a scenario which results in an optimization of a computationally expensive black-box function. In these settings, there will often exist candidate designs which cause the simulation to fail, and can therefore degrade the search effectiveness. To address this issue, this paper proposes a new metamodel-assisted computational intelligence optimization algorithm which incorporates classifiers into the optimization search. The classifiers predict which candidate designs are expected to cause the simulation to fail, and this prediction is used to bias the search towards designs predicted to be valid. To enhance the search effectiveness, the proposed algorithm uses an ensemble approach which concurrently employs several metamodels and classifiers. A rigorous performance analysis based on a set of simulation-driven design optimization problems shows the effectiveness of the proposed algorithm.
Parallel grid generation algorithm for distributed memory computers
NASA Technical Reports Server (NTRS)
Moitra, Stuti; Moitra, Anutosh
1994-01-01
A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.
An Agent Inspired Reconfigurable Computing Implementation of a Genetic Algorithm
NASA Technical Reports Server (NTRS)
Weir, John M.; Wells, B. Earl
2003-01-01
Many software systems have been successfully implemented using an agent paradigm which employs a number of independent entities that communicate with one another to achieve a common goal. The distributed nature of such a paradigm makes it an excellent candidate for use in high speed reconfigurable computing hardware environments such as those present in modem FPGA's. In this paper, a distributed genetic algorithm that can be applied to the agent based reconfigurable hardware model is introduced. The effectiveness of this new algorithm is evaluated by comparing the quality of the solutions found by the new algorithm with those found by traditional genetic algorithms. The performance of a reconfigurable hardware implementation of the new algorithm on an FPGA is compared to traditional single processor implementations.
JPL control/structure interaction test bed real-time control computer architecture
NASA Technical Reports Server (NTRS)
Briggs, Hugh C.
1989-01-01
The Control/Structure Interaction Program is a technology development program for spacecraft that exhibit interactions between the control system and structural dynamics. The program objectives include development and verification of new design concepts - such as active structure - and new tools - such as combined structure and control optimization algorithm - and their verification in ground and possibly flight test. A focus mission spacecraft was designed based upon a space interferometer and is the basis for design of the ground test article. The ground test bed objectives include verification of the spacecraft design concepts, the active structure elements and certain design tools such as the new combined structures and controls optimization tool. In anticipation of CSI technology flight experiments, the test bed control electronics must emulate the computation capacity and control architectures of space qualifiable systems as well as the command and control networks that will be used to connect investigators with the flight experiment hardware. The Test Bed facility electronics were functionally partitioned into three units: a laboratory data acquisition system for structural parameter identification and performance verification; an experiment supervisory computer to oversee the experiment, monitor the environmental parameters and perform data logging; and a multilevel real-time control computing system. The design of the Test Bed electronics is presented along with hardware and software component descriptions. The system should break new ground in experimental control electronics and is of interest to anyone working in the verification of control concepts for large structures.
A computational study of routing algorithms for realistic transportation networks
Jacob, R.; Marathe, M.V.; Nagel, K.
1998-12-01
The authors carry out an experimental analysis of a number of shortest path (routing) algorithms investigated in the context of the TRANSIMS (Transportation Analysis and Simulation System) project. The main focus of the paper is to study how various heuristic and exact solutions, associated data structures affected the computational performance of the software developed especially for realistic transportation networks. For this purpose the authors have used Dallas Fort-Worth road network with very high degree of resolution. The following general results are obtained: (1) they discuss and experimentally analyze various one-one shortest path algorithms, which include classical exact algorithms studied in the literature as well as heuristic solutions that are designed to take into account the geometric structure of the input instances; (2) they describe a number of extensions to the basic shortest path algorithm. These extensions were primarily motivated by practical problems arising in TRANSIMS and ITS (Intelligent Transportation Systems) related technologies. Extensions discussed include--(i) time dependent networks, (ii) multi-modal networks, (iii) networks with public transportation and associated schedules. Computational results are provided to empirically compare the efficiency of various algorithms. The studies indicate that a modified Dijkstra`s algorithm is computationally fast and an excellent candidate for use in various transportation planning applications as well as ITS related technologies.
Multipole Algorithms for Molecular Dynamics Simulation on High Performance Computers.
NASA Astrophysics Data System (ADS)
Elliott, William Dewey
1995-01-01
A fundamental problem in modeling large molecular systems with molecular dynamics (MD) simulations is the underlying N-body problem of computing the interactions between all pairs of N atoms. The simplest algorithm to compute pair-wise atomic interactions scales in runtime {cal O}(N^2), making it impractical for interesting biomolecular systems, which can contain millions of atoms. Recently, several algorithms have become available that solve the N-body problem by computing the effects of all pair-wise interactions while scaling in runtime less than {cal O}(N^2). One algorithm, which scales {cal O}(N) for a uniform distribution of particles, is called the Greengard-Rokhlin Fast Multipole Algorithm (FMA). This work describes an FMA-like algorithm called the Molecular Dynamics Multipole Algorithm (MDMA). The algorithm contains several features that are new to N-body algorithms. MDMA uses new, efficient series expansion equations to compute general 1/r^{n } potentials to arbitrary accuracy. In particular, the 1/r Coulomb potential and the 1/r^6 portion of the Lennard-Jones potential are implemented. The new equations are based on multivariate Taylor series expansions. In addition, MDMA uses a cell-to-cell interaction region of cells that is closely tied to worst case error bounds. The worst case error bounds for MDMA are derived in this work also. These bounds apply to other multipole algorithms as well. Several implementation enhancements are described which apply to MDMA as well as other N-body algorithms such as FMA and tree codes. The mathematics of the cell -to-cell interactions are converted to the Fourier domain for reduced operation count and faster computation. A relative indexing scheme was devised to locate cells in the interaction region which allows efficient pre-computation of redundant information and prestorage of much of the cell-to-cell interaction. Also, MDMA was integrated into the MD program SIgMA to demonstrate the performance of the program over
Computational enhancement of an unsymmetric block Lanczos algorithm
NASA Technical Reports Server (NTRS)
Kim, Hyoung M.; Craig, Roy R., Jr.
1990-01-01
An unsymmetric block Lanczos algorithm has been employed for the dynamic analysis of a large system which has arbitrary damping and/or repeated (or closely spaced) eigenvalues. In the algorithm development, the right and left Lanczos vectors are all theoretically biorthogonal to each other. However, these vectors may lose the biorthogonality owing to cancellation and roundoff errors. For the unsymmetric case there can be a breakdown, even without numerical errors. This paper describes computational techniques which have led to a robust unsymmetric block Lanczos algorithm.
A computationally efficient QRS detection algorithm for wearable ECG sensors.
Wang, Y; Deepu, C J; Lian, Y
2011-01-01
In this paper we present a novel Dual-Slope QRS detection algorithm with low computational complexity, suitable for wearable ECG devices. The Dual-Slope algorithm calculates the slopes on both sides of a peak in the ECG signal; And based on these slopes, three criterions are developed for simultaneously checking 1)Steepness 2)Shape and 3)Height of the signal, to locate the QRS complex. The algorithm, evaluated against MIT/BIH Arrhythmia Database, achieves a very high detection rate of 99.45%, a sensitivity of 99.82% and a positive prediction of 99.63%. PMID:22255619
Adaptive kinetic-fluid solvers for heterogeneous computing architectures
NASA Astrophysics Data System (ADS)
Zabelok, Sergey; Arslanbekov, Robert; Kolobov, Vladimir
2015-12-01
We show feasibility and benefits of porting an adaptive multi-scale kinetic-fluid code to CPU-GPU systems. Challenges are due to the irregular data access for adaptive Cartesian mesh, vast difference of computational cost between kinetic and fluid cells, and desire to evenly load all CPUs and GPUs during grid adaptation and algorithm refinement. Our Unified Flow Solver (UFS) combines Adaptive Mesh Refinement (AMR) with automatic cell-by-cell selection of kinetic or fluid solvers based on continuum breakdown criteria. Using GPUs enables hybrid simulations of mixed rarefied-continuum flows with a million of Boltzmann cells each having a 24 × 24 × 24 velocity mesh. We describe the implementation of CUDA kernels for three modules in UFS: the direct Boltzmann solver using the discrete velocity method (DVM), the Direct Simulation Monte Carlo (DSMC) solver, and a mesoscopic solver based on the Lattice Boltzmann Method (LBM), all using adaptive Cartesian mesh. Double digit speedups on single GPU and good scaling for multi-GPUs have been demonstrated.
Parallel matrix transpose algorithms on distributed memory concurrent computers
Choi, J.; Walker, D.W.; Dongarra, J.J. |
1993-10-01
This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.
Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming
NASA Technical Reports Server (NTRS)
Dorband, John E.; Aburdene, Maurice F.
2002-01-01
Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.
NASA Technical Reports Server (NTRS)
Eberhardt, D. S.; Baganoff, D.; Stevens, K.
1984-01-01
Implicit approximate-factored algorithms have certain properties that are suitable for parallel processing. A particular computational fluid dynamics (CFD) code, using this algorithm, is mapped onto a multiple-instruction/multiple-data-stream (MIMD) computer architecture. An explanation of this mapping procedure is presented, as well as some of the difficulties encountered when trying to run the code concurrently. Timing results are given for runs on the Ames Research Center's MIMD test facility which consists of two VAX 11/780's with a common MA780 multi-ported memory. Speedups exceeding 1.9 for characteristic CFD runs were indicated by the timing results.
Radiation transport algorithms on trans-petaflops supercomputers of different architectures.
Christopher, Thomas Woods
2003-08-01
We seek to understand which supercomputer architecture will be best for supercomputers at the Petaflops scale and beyond. The process we use is to predict the cost and performance of several leading architectures at various years in the future. The basis for predicting the future is an expanded version of Moore's Law called the International Technology Roadmap for Semiconductors (ITRS). We abstract leading supercomputer architectures into chips connected by wires, where the chips and wires have electrical parameters predicted by the ITRS. We then compute the cost of a supercomputer system and the run time on a key problem of interest to the DOE (radiation transport). These calculations are parameterized by the time into the future and the technology expected to be available at that point. We find the new advanced architectures have substantial performance advantages but conventional designs are likely to be less expensive (due to economies of scale). We do not find a universal ''winner'', but instead the right architectural choice is likely to involve non-technical factors such as the availability of capital and how long people are willing to wait for results.
Algorithmic techniques for computer vision on a fine-grained parallel machine
Little, J.J.; Blelloch, G.E.; Cass, T.A.
1989-03-01
The authors describe algorithms for several problems from computer vision, and illustrate how they are implemented using a set of primitive parallel operations. The primitives the authors use include general permutations, grid permutations, and the scan operation - a restricted form of the prefix computation. They cover well-known problems allowing us to concentrate on the implementations rather than the problems. First, they describe some simple routines such as border following, computing histograms and filtering. They then discuss several modules built on these routines including edge detection, Hough transforms, and connected component labeling. Finally, they describe how these modules are composed into higher level vision modules. By defining the routines using a set of primitives operations, they abstract away from a particular architecture. In particular, one does not have to worry about features of machines such as the number of processors or whether a tightly connected architecture has a hypercube network or a four-dimensional grid network. One still needs to worry about the relative performance of the primitives on particular machines. The authors discuss the tradeoffs among primitives and try to identify which primitives are most important for particular problems. All the primitives discussed are supported by the Connection Machine (CM), and they outline how they are implemented. They have implemented most of the algorithms described on the Connection Machine.
NASA Astrophysics Data System (ADS)
Chao, Hu; Ray, Sylvian R.; Zheng, Nanning
1995-06-01
A new DSP-based neural simulating computer architecture and its ANN-based assignment method for parallel distributed processing are proposed. The hardware of the proposed neural simulating computer can be reconfigured in terms of a variety of research interests and requirements of pattern recognition. The software programming environment utilizes an intelligent compiler to perform static task assignment in both the cases of single-task muliprocessor and multitask processor. An improved Hopfield neural network which can converge to global optical solution is employed by the compiler to map different tasks or neurons to their corresponding real processors. An approach of introducing hidden layer to increase the computation ability of the neural simulating computer is also developed. Finally, a proof is given which shows that the use of improved Hopfield algorithm and the modification to network structure doesn't change the intrinsic properties of the original network.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.
Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein
2015-12-01
Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.
Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein
2015-12-01
Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed. PMID:26364202
Proceedings of the 12th annual international symposium on computer architecture
Not Available
1985-01-01
This book presents the papers given at a conference on supercomputers and multiprocessors. Topics considered at the conference included array processing, pipelined central processing units, the classification of parallel processor architectures, caches and registers, LISP machines, special purpose parallel processors, uniprocessors, logic programming machines, commercial multiprocessors, fifth generation computer systems projects, data retrieval architectures, systolic array, data flow and reduction, and interconnection networks.
Beard, R.A.
1990-03-01
The purpose of this thesis is to explore the methods used to parallelize NP-complete problems and the degree of improvement that can be realized using a distributed parallel processor to solve these combinatoric problems. Common NP-complete problem characteristics such as a priori reductions, use of partial-state information, and inhomogeneous searches are identified and studied. The set covering problem (SCP) is implemented for this research because many applications such as information retrieval, task scheduling, and VLSI expression simplification can be structured as an SCP problem. In addition, its generic NP-complete common characteristics are well documented and a parallel implementation has not been reported. Parallel programming design techniques involve decomposing the problem and developing the parallel algorithms. The major components of a parallel solution are developed in a four phase process. First, a meta-level design is accomplished using an appropriate design language such as UNITY. Then, the UNITY design is transformed into an algorithm and implementation specific to a distributed architecture. Finally, a complexity analysis of the algorithm is performed. the a priori reductions are divided-and-conquer algorithms; whereas, the search for the optimal set cover is accomplished with a branch-and-bound algorithm. The search utilizes a global best cost maintained at a central location for distribution to all processors. Three methods of load balancing are implemented and studied: coarse grain with static allocation of the search space, fine grain with dynamic allocation, and dynamic load balancing.
Computational Fluid Dynamics. [numerical methods and algorithm development
NASA Technical Reports Server (NTRS)
1992-01-01
This collection of papers was presented at the Computational Fluid Dynamics (CFD) Conference held at Ames Research Center in California on March 12 through 14, 1991. It is an overview of CFD activities at NASA Lewis Research Center. The main thrust of computational work at Lewis is aimed at propulsion systems. Specific issues related to propulsion CFD and associated modeling will also be presented. Examples of results obtained with the most recent algorithm development will also be presented.
Adaptive-mesh algorithms for computational fluid dynamics
NASA Technical Reports Server (NTRS)
Powell, Kenneth G.; Roe, Philip L.; Quirk, James
1993-01-01
The basic goal of adaptive-mesh algorithms is to distribute computational resources wisely by increasing the resolution of 'important' regions of the flow and decreasing the resolution of regions that are less important. While this goal is one that is worthwhile, implementing schemes that have this degree of sophistication remains more of an art than a science. In this paper, the basic pieces of adaptive-mesh algorithms are described and some of the possible ways to implement them are discussed and compared. These basic pieces are the data structure to be used, the generation of an initial mesh, the criterion to be used to adapt the mesh to the solution, and the flow-solver algorithm on the resulting mesh. Each of these is discussed, with particular emphasis on methods suitable for the computation of compressible flows.
NML computation algorithms for tree-structured multinomial Bayesian networks.
Kontkanen, Petri; Wettig, Hannes; Myllymäki, Petri
2007-01-01
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains, it is important to develop efficient algorithms suitable for discrete data. The minimum description length (MDL) principle is a theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size, since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for efficient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending these algorithms to more complex, tree-structured Bayesian networks. PMID:18382603
Computations and algorithms in physical and biological problems
NASA Astrophysics Data System (ADS)
Qin, Yu
This dissertation presents the applications of state-of-the-art computation techniques and data analysis algorithms in three physical and biological problems: assembling DNA pieces, optimizing self-assembly yield, and identifying correlations from large multivariate datasets. In the first topic, in-depth analysis of using Sequencing by Hybridization (SBH) to reconstruct target DNA sequences shows that a modified reconstruction algorithm can overcome the theoretical boundary without the need for different types of biochemical assays and is robust to error. In the second topic, consistent with theoretical predictions, simulations using Graphics Processing Unit (GPU) demonstrate how controlling the short-ranged interactions between particles and controlling the concentrations optimize the self-assembly yield of a desired structure, and nonequilibrium behavior when optimizing concentrations is also unveiled by leveraging the computation capacity of GPUs. In the last topic, a methodology to incorporate existing categorization information into the search process to efficiently reconstruct the optimal true correlation matrix for multivariate datasets is introduced. Simulations on both synthetic and real financial datasets show that the algorithm is able to detect signals below the Random Matrix Theory (RMT) threshold. These three problems are representatives of using massive computation techniques and data analysis algorithms to tackle optimization problems, and outperform theoretical boundary when incorporating prior information into the computation.
Plagiarism Detection Algorithm for Source Code in Computer Science Education
ERIC Educational Resources Information Center
Liu, Xin; Xu, Chan; Ouyang, Boyu
2015-01-01
Nowadays, computer programming is getting more necessary in the course of program design in college education. However, the trick of plagiarizing plus a little modification exists among some students' home works. It's not easy for teachers to judge if there's plagiarizing in source code or not. Traditional detection algorithms cannot fit this…
Algorithms, Computation and Mathematics (Fortran Supplement). Teacher's Commentary. Revised Edition.
ERIC Educational Resources Information Center
Charp, Sylvia; And Others
This is the teacher's guide and commentary for the SMSG textbook Algorithms, Computation, and Mathematics (Fortran Supplement). The teacher's commentary provides background information for the teacher, suggestions for activities found in the Fortran Supplement, and answers for exercises and activities. The course is designed for high school…
An Architecture for the Emotion-Based Ubiquitous Services in Wearable Computing Environment
NASA Astrophysics Data System (ADS)
Lee, Haesung; Kwon, Joonhee
The exhibition of informatics services does not always attend to the context that involves the user's physiologically emotional signals. Wearable computing services with user's emotional information could be extremely effective in various applications. In this paper, with the aim of developing more effective human-centric services, we propose a novel methodological architecture. Proposed architecture employs the convergence of wearable computing technique, human's emotional information and context tagging technique to develop more realistic and robust human-centric service.
Survivable algorithms and redundancy management in NASA's distributed computing systems
NASA Technical Reports Server (NTRS)
Malek, Miroslaw
1992-01-01
The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given.
Optimization of computer-generated binary holograms using genetic algorithms
NASA Astrophysics Data System (ADS)
Cojoc, Dan; Alexandrescu, Adrian
1999-11-01
The aim of this paper is to compare genetic algorithms against direct point oriented coding in the design of binary phase Fourier holograms, computer generated. These are used as fan-out elements for free space optical interconnection. Genetic algorithms are optimization methods which model the natural process of genetic evolution. The configuration of the hologram is encoded to form a chromosome. To start the optimization, a population of different chromosomes randomly generated is considered. The chromosomes compete, mate and mutate until the best chromosome is obtained according to a cost function. After explaining the operators that are used by genetic algorithms, this paper presents two examples with 32 X 32 genes in a chromosome. The crossover type and the number of mutations are shown to be important factors which influence the convergence of the algorithm. GA is demonstrated to be a useful tool to design namely binary phase holograms of complicate structures.
Comparison and improvement of algorithms for computing minimal cut sets
2013-01-01
Background Constrained minimal cut sets (cMCSs) have recently been introduced as a framework to enumerate minimal genetic intervention strategies for targeted optimization of metabolic networks. Two different algorithmic schemes (adapted Berge algorithm and binary integer programming) have been proposed to compute cMCSs from elementary modes. However, in their original formulation both algorithms are not fully comparable. Results Here we show that by a small extension to the integer program both methods become equivalent. Furthermore, based on well-known preprocessing procedures for integer programming we present efficient preprocessing steps which can be used for both algorithms. We then benchmark the numerical performance of the algorithms in several realistic medium-scale metabolic models. The benchmark calculations reveal (i) that these preprocessing steps can lead to an enormous speed-up under both algorithms, and (ii) that the adapted Berge algorithm outperforms the binary integer approach. Conclusions Generally, both of our new implementations are by at least one order of magnitude faster than other currently available implementations. PMID:24191903
Optical pattern recognition architecture implementing the mean-square error correlation algorithm
Molley, Perry A.
1991-01-01
An optical architecture implementing the mean-square error correlation algorithm, MSE=.SIGMA.[I-R].sup.2 for discriminating the presence of a reference image R in an input image scene I by computing the mean-square-error between a time-varying reference image signal s.sub.1 (t) and a time-varying input image signal s.sub.2 (t) includes a laser diode light source which is temporally modulated by a double-sideband suppressed-carrier source modulation signal I.sub.1 (t) having the form I.sub.1 (t)=A.sub.1 [1+.sqroot.2m.sub.1 s.sub.1 (t)cos (2.pi.f.sub.o t)] and the modulated light output from the laser diode source is diffracted by an acousto-optic deflector. The resultant intensity of the +1 diffracted order from the acousto-optic device is given by: I.sub.2 (t)=A.sub.2 [+2m.sub.2.sup.2 s.sub.2.sup.2 (t)-2.sqroot.2m.sub.2 (t) cos (2.pi.f.sub.o t] The time integration of the two signals I.sub.1 (t) and I.sub.2 (t) on the CCD deflector plane produces the result R(.tau.) of the mean-square error having the form: R(.tau.)=A.sub.1 A.sub.2 {[T]+[2m.sub.2.sup.2.multidot..intg.s.sub.2.sup.2 (t-.tau.)dt]-[2m.sub.1 m.sub.2 cos (2.tau.f.sub.o .tau.).multidot..intg.s.sub.1 (t)s.sub.2 (t-.tau.)dt]} where: s.sub.1 (t) is the signal input to the diode modulation source: s.sub.2 (t) is the signal input to the AOD modulation source; A.sub.1 is the light intensity; A.sub.2 is the diffraction efficiency; m.sub.1 and m.sub.2 are constants that determine the signal-to-bias ratio; f.sub.o is the frequency offset between the oscillator at f.sub.c and the modulation at f.sub.c +f.sub.o ; and a.sub.o and a.sub.1 are constant chosen to bias the diode source and the acousto-optic deflector into their respective linear operating regions so that the diode source exhibits a linear intensity characteristic and the AOD exhibits a linear amplitude characteristic.
Computational algorithms to simulate the steel continuous casting
NASA Astrophysics Data System (ADS)
Ramírez-López, A.; Soto-Cortés, G.; Palomar-Pardavé, M.; Romero-Romo, M. A.; Aguilar-López, R.
2010-10-01
Computational simulation is a very powerful tool to analyze industrial processes to reduce operating risks and improve profits from equipment. The present work describes the development of some computational algorithms based on the numerical method to create a simulator for the continuous casting process, which is the most popular method to produce steel products for metallurgical industries. The kinematics of industrial processing was computationally reproduced using subroutines logically programmed. The cast steel by each strand was calculated using an iterative method nested in the main loop. The process was repeated at each time step (Δ t) to calculate the casting time, simultaneously, the steel billets produced were counted and stored. The subroutines were used for creating a computational representation of a continuous casting plant (CCP) and displaying the simulation of the steel displacement through the CCP. These algorithms have been developed to create a simulator using the programming language C++. Algorithms for computer animation of the continuous casting process were created using a graphical user interface (GUI). Finally, the simulator functionality was shown and validated by comparing with the industrial information of the steel production of three casters.
ERIC Educational Resources Information Center
Amenyo, John-Thones
2012-01-01
Carefully engineered playable games can serve as vehicles for students and practitioners to learn and explore the programming of advanced computer architectures to execute applications, such as high performance computing (HPC) and complex, inter-networked, distributed systems. The article presents families of playable games that are grounded in…
NASA Technical Reports Server (NTRS)
Matthews, Bryan L.; Srivastava, Ashok N.
2010-01-01
Prior to the launch of STS-119 NASA had completed a study of an issue in the flow control valve (FCV) in the Main Propulsion System of the Space Shuttle using an adaptive learning method known as Virtual Sensors. Virtual Sensors are a class of algorithms that estimate the value of a time series given other potentially nonlinearly correlated sensor readings. In the case presented here, the Virtual Sensors algorithm is based on an ensemble learning approach and takes sensor readings and control signals as input to estimate the pressure in a subsystem of the Main Propulsion System. Our results indicate that this method can detect faults in the FCV at the time when they occur. We use the standard deviation of the predictions of the ensemble as a measure of uncertainty in the estimate. This uncertainty estimate was crucial to understanding the nature and magnitude of transient characteristics during startup of the engine. This paper overviews the Virtual Sensors algorithm and discusses results on a comprehensive set of Shuttle missions and also discusses the architecture necessary for deploying such algorithms in a real-time, closed-loop system or a human-in-the-loop monitoring system. These results were presented at a Flight Readiness Review of the Space Shuttle in early 2009.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
An algorithm for computing the distance spectrum of trellis codes
NASA Technical Reports Server (NTRS)
Rouanne, Marc; Costello, Daniel J., Jr.
1989-01-01
A class of quasiregular codes is defined for which the distance spectrum can be calculated from the codeword corresponding to the all-zero information sequence. Convolutional codes and regular codes are both quasiregular, as well as most of the best known trellis codes. An algorithm to compute the distance spectrum of linear, regular, and quasiregular trellis codes is presented. In particular, it can calculate the weight spectrum of convolutional (linear trellis) codes and the distance spectrum of most of the best known trellis codes. The codes do not have to be linear or regular, and the signals do not have to be used with equal probabilities. The algorithm is derived from a bidirectional stack algorithm, although it could also be based on the Viterbi algorithm. The algorithm is used to calculate the beginning of the distance spectrum of some of the best known trellis codes and to compute tight estimates on the first-event-error probability and on the bit-error probability.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
A novel algorithm and its VLSI architecture for connected component labeling
NASA Astrophysics Data System (ADS)
Zhao, Hualong; Sang, Hongshi; Zhang, Tianxu
2011-11-01
A novel line-based streaming labeling algorithm with its VLSI architecture is proposed in this paper. Line-based neighborhood examination scheme is used for efficient local connected components extraction. A novel reversed rooted tree hook-up strategy, which is very suitable for hardware implementation, is applied on the mergence stage of equivalent connected components. The reversed rooted tree hook-up strategy significant reduces the requirement of on-chip memory, which makes the chip area smaller. Clock domains crossing FIFOs are also applied for connecting the label core and external memory interface, which makes the label engine working in a higher frequency and raises the throughput of the label engine. Several performance tests have been performed for our proposed hardware implementation. The processing bandwidth of our hardware architecture can reach the I/O transfer boundary according to the external interface clock in all the real image tests. Beside the advantage of reducing the processing time, our hardware implementation can support the image size as large as 4096*4096, which will be very appealing in remote sensing or any other high-resolution image applications. The implementation of proposed architecture is synthesized with SMIC 180nm standard cell library. The work frequency of the label engine reaches 200MHz.
An Efficient Circulant MIMO Equalizer for CDMA Downlink: Algorithm and VLSI Architecture
NASA Astrophysics Data System (ADS)
Guo, Yuanbin; Zhang, Jianzhong(Charlie); McCain, Dennis; Cavallaro, Joseph R.
2006-12-01
We present an efficient circulant approximation-based MIMO equalizer architecture for the CDMA downlink. This reduces the direct matrix inverse (DMI) of size[InlineEquation not available: see fulltext.] with[InlineEquation not available: see fulltext.] complexity to some FFT operations with[InlineEquation not available: see fulltext.] complexity and the inverse of some[InlineEquation not available: see fulltext.] submatrices. We then propose parallel and pipelined VLSI architectures with Hermitian optimization and reduced-state FFT for further complexity optimization. Generic VLSI architectures are derived for the[InlineEquation not available: see fulltext.] high-order receiver from partitioned[InlineEquation not available: see fulltext.] submatrices. This leads to more parallel VLSI design with[InlineEquation not available: see fulltext.] further complexity reduction. Comparative study with both the conjugate-gradient and DMI algorithms shows very promising performance/complexity tradeoff. VLSI design space in terms of area/time efficiency is explored extensively for layered parallelism and pipelining with a Catapult C high-level-synthesis methodology.
Ehsan, Shoaib; Clark, Adrian F.; ur Rehman, Naveed; McDonald-Maier, Klaus D.
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
Ehsan, Shoaib; Clark, Adrian F; Naveed ur Rehman; McDonald-Maier, Klaus D
2015-01-01
The integral image, an intermediate image representation, has found extensive use in multi-scale local feature detection algorithms, such as Speeded-Up Robust Features (SURF), allowing fast computation of rectangular features at constant speed, independent of filter size. For resource-constrained real-time embedded vision systems, computation and storage of integral image presents several design challenges due to strict timing and hardware limitations. Although calculation of the integral image only consists of simple addition operations, the total number of operations is large owing to the generally large size of image data. Recursive equations allow substantial decrease in the number of operations but require calculation in a serial fashion. This paper presents two new hardware algorithms that are based on the decomposition of these recursive equations, allowing calculation of up to four integral image values in a row-parallel way without significantly increasing the number of operations. An efficient design strategy is also proposed for a parallel integral image computation unit to reduce the size of the required internal memory (nearly 35% for common HD video). Addressing the storage problem of integral image in embedded vision systems, the paper presents two algorithms which allow substantial decrease (at least 44.44%) in the memory requirements. Finally, the paper provides a case study that highlights the utility of the proposed architectures in embedded vision systems. PMID:26184211
An efficient FPGA architecture for integer ƞth root computation
NASA Astrophysics Data System (ADS)
Rangel-Valdez, Nelson; Barron-Zambrano, Jose Hugo; Torres-Huitzil, Cesar; Torres-Jimenez, Jose
2015-10-01
In embedded computing, it is common to find applications such as signal processing, image processing, computer graphics or data compression that might benefit from hardware implementation for the computation of integer roots of order ?. However, the scientific literature lacks architectural designs that implement such operations for different values of N, using a low amount of resources. This article presents a parameterisable field programmable gate array (FPGA) architecture for an efficient Nth root calculator that uses only adders/subtractors and ? location memory elements. The architecture was tested for different values of ?, using 64-bit number representation. The results show a consumption up to 10% of the logical resources of a Xilinx XC6SLX45-CSG324C device, depending on the value of N. The hardware implementation improved the performance of its corresponding software implementations in one order of magnitude. The architecture performance varies from several thousands to seven millions of root operations per second.
Matrix-based, finite-difference algorithms for computational acoustics
NASA Technical Reports Server (NTRS)
Davis, Sanford
1990-01-01
A compact numerical algorithm is introduced for simulating multidimensional acoustic waves. The algorithm is expressed in terms of a set of matrix coefficients on a three-point spatial grid that approximates the acoustic wave equation with a discretization error of O(h exp 5). The method is based on tracking a local phase variable and its implementation suggests a convenient coordinate splitting along with natural intermediate boundary conditions. Results are presented for oblique plane waves and compared with other procedures. Preliminary computations of acoustic diffraction are also considered.
Algorithm-based analysis of collective decoherence in quantum computation
NASA Astrophysics Data System (ADS)
Utsunomiya, Shoko; Master, Cyrus P.; Yamamoto, Yoshihisa
2007-02-01
In a quantum computer, qubits are often stored in identical two-level systems separated by a distance shorter than the characteristic wavelength of the reservoirs that are responsible for decoherence. In this case the collective qubit-reservoir interaction, rather than the individual qubit-reservoir interaction, may determine the decoherence properties. We study the collective decoherence behavior in between each step in certain quantum algorithms and propose a simple alternative of implementing quantum algorithms using a quantum trajectory that is close to a decoherence-free subspace that avoids unstable Dicke's superradiant states and Schrödinger's cat state.
Architectural issues in fault-tolerant, secure computing systems
Joseph, M.K.
1988-01-01
This dissertation explores several facets of the applicability of fault-tolerance techniques to secure computer design, these being: (1) how fault-tolerance techniques can be used on unsolved problems in computer security (e.g., computer viruses, and denial-of-service); (2) how fault-tolerance techniques can be used to support classical computer-security mechanisms in the presence of accidental and deliberate faults; and (3) the problems involved in designing a fault-tolerant, secure computer system (e.g., how computer security can degrade along with both the computational and fault-tolerance capabilities of a computer system). The approach taken in this research is almost as important as its results. It is different from current computer-security research in that a design paradigm for fault-tolerant computer design is used. This led to an extensive fault and error classification of many typical security threats. Throughout this work, a fault-tolerance perspective is taken. However, the author did not ignore basic computer-security technology. For some problems he investigated how to support and extend basic-security mechanism (e.g., trusted computing base), instead of trying to achieve the same result with purely fault-tolerance techniques.
Computer architecture providing high-performance and low-cost solutions for fast fMRI reconstruction
NASA Astrophysics Data System (ADS)
Chao, Hui; Goddard, J. Iain
1998-07-01
Due to the dynamic nature of brain studies in functional magnetic resonance imaging (fMRI), fast pulse sequences such as echo planar imaging (EPI) and spiral are often used for higher temporal resolution. Hundreds of frames of two- dimensional (2-D) images or multiple three-dimensional (3-D) images are often acquired to cover a larger space and time range. Therefore, fMRI often requires a much larger data storage, faster data transfer rate and higher processing power than conventional MRI. In Mercury Computer Systems' PCI-based embedded computer system, the computer architecture allows the concurrent use of a DMA engine for data transfer and CPU for data processing. This architecture allows a multicomputer to distribute processing and data with minimal time spent transferring data. Different types and numbers of processors are available to optimize system performance for the application. The fMRI reconstruction was first implemented in Mercury's PCI-based embedded computer system by using one digital signal processing (DSP) chip, with the host computer running under the Windows NTR platform. Double buffers in SRAM or cache were created for concurrent I/O and processing. The fMRI reconstruction was then implemented in parallel using multiple DSP chips. Data transfer and interprocessor synchronization were carefully managed to optimize algorithm efficiency. The image reconstruction times were measured with different numbers of processors ranging from one to 10. With one DSP chip, the timing for reconstructing 100 fMRI images measuring 128 X 64 pixels was 1.24 seconds, which is already faster than most existing commercial MRI systems. This PCI-based embedded multicomputer architecture, which has a nearly linear improvement in performance, provides high performance for fMRI processing. In summary, this embedded multicomputer system allows the choice of computer topologies to fit the specific application to achieve maximum system performance.
An Efficient VLSI Architecture of the Enhanced Three Step Search Algorithm
NASA Astrophysics Data System (ADS)
Biswas, Baishik; Mukherjee, Rohan; Saha, Priyabrata; Chakrabarti, Indrajit
2016-09-01
The intense computational complexity of any video codec is largely due to the motion estimation unit. The Enhanced Three Step Search is a popular technique that can be adopted for fast motion estimation. This paper proposes a novel VLSI architecture for the implementation of the Enhanced Three Step Search Technique. A new addressing mechanism has been introduced which enhances the speed of operation and reduces the area requirements. The proposed architecture when implemented in Verilog HDL on Virtex-5 Technology and synthesized using Xilinx ISE Design Suite 14.1 achieves a critical path delay of 4.8 ns while the area comes out to be 2.9K gate equivalent. It can be incorporated in commercial devices like smart-phones, camcorders, video conferencing systems etc.
Architectural and Algorithmic Requirements for a Next-Generation System Analysis Code
V.A. Mousseau
2010-05-01
This document presents high-level architectural and system requirements for a next-generation system analysis code (NGSAC) to support reactor safety decision-making by plant operators and others, especially in the context of light water reactor plant life extension. The capabilities of NGSAC will be different from those of current-generation codes, not only because computers have evolved significantly in the generations since the current paradigm was first implemented, but because the decision-making processes that need the support of next-generation codes are very different from the decision-making processes that drove the licensing and design of the current fleet of commercial nuclear power reactors. The implications of these newer decision-making processes for NGSAC requirements are discussed, and resulting top-level goals for the NGSAC are formulated. From these goals, the general architectural and system requirements for the NGSAC are derived.
Computational Discovery of Materials Using the Firefly Algorithm
NASA Astrophysics Data System (ADS)
Avendaño-Franco, Guillermo; Romero, Aldo
Our current ability to model physical phenomena accurately, the increase computational power and better algorithms are the driving forces behind the computational discovery and design of novel materials, allowing for virtual characterization before their realization in the laboratory. We present the implementation of a novel firefly algorithm, a population-based algorithm for global optimization for searching the structure/composition space. This novel computation-intensive approach naturally take advantage of concurrency, targeted exploration and still keeping enough diversity. We apply the new method in both periodic and non-periodic structures and we present the implementation challenges and solutions to improve efficiency. The implementation makes use of computational materials databases and network analysis to optimize the search and get insights about the geometric structure of local minima on the energy landscape. The method has been implemented in our software PyChemia, an open-source package for materials discovery. We acknowledge the support of DMREF-NSF 1434897 and the Donors of the American Chemical Society Petroleum Research Fund for partial support of this research under Contract 54075-ND10.
Malyj, W; Smith, R E; Horowitz, J M
1984-01-01
The new generation of microcomputers has brought computing power previously restricted to mainframe and supermini computers within the reach of individual scientific laboratories. Microcomputers can now provide computing speeds rivaling mainframes and computational accuracies exceeding those available in most computer centers. Inexpensive memory makes possible the transfer to microcomputers of software packages developed for mainframes and tested by years of experience. Combinations of high level languages and assembler subroutines permit the efficient design of specialized applications programs. Microprocessor architecture is approaching that of superminis, with coprocessors providing major contributions to computing power. The combined result of these developments is a major and perhaps revolutionary increase in the computing power now available to scientists.
Reference Architecture for High Dependability On-Board Computers
NASA Astrophysics Data System (ADS)
Silva, Nuno; Esper, Alexandre; Zandin, Johan; Barbosa, Ricardo; Monteleone, Claudio
2014-08-01
The industrial process in the area of on-board computers is characterized by small production series of on-board computers (hardware and software) configuration items with little recurrence at unit or set level (e.g. computer equipment unit, set of interconnected redundant units). These small production series result into a reduced amount of statistical data related to dependability, which influence on the way on-board computers are specified, designed and verified. In the context of ESA harmonization policy for the deployment of enhanced and homogeneous industrial processes in the area of avionics embedded systems and on-board computers for the space industry, this study aimed at rationalizing the initiation phase of the development or procurement of on-board computers and at improving dependability assurance. This aim was achieved by establishing generic requirements for the procurement or development of on-board computers with a focus on well-defined reliability, availability, and maintainability requirements, as well as a generic methodology for planning, predicting and assessing the dependability of on-board computers hardware and software throughout their life cycle. It also provides guidelines for producing evidence material and arguments to support dependability assurance of on-board computers hardware and software throughout the complete lifecycle, including an assessment of feasibility aspects of the dependability assurance process and how the use of computer-aided environment can contribute to the on-board computer dependability assurance.
Computational modeling of red blood cells: A symplectic integration algorithm
NASA Astrophysics Data System (ADS)
Schiller, Ulf D.; Ladd, Anthony J. C.
2010-03-01
Red blood cells can undergo shape transformations that impact the rheological properties of blood. Computational models have to account for the deformability and red blood cells are often modeled as elastically deformable objects. We present a symplectic integration algorithm for deformable objects. The surface is represented by a set of marker points obtained by surface triangulation, along with a set of fiber vectors that describe the orientation of the material plane. The various elastic energies are formulated in terms of these variables and the equations of motion are obtained by exact differentiation of a discretized Hamiltonian. The integration algorithm preserves the Hamiltonian structure and leads to highly accurate energy conservation, hence he method is expected to be more stable than conventional finite element methods. We apply the algorithm to simulate the shape dynamics of red blood cells.
State-Estimation Algorithm Based on Computer Vision
NASA Technical Reports Server (NTRS)
Bayard, David; Brugarolas, Paul
2007-01-01
An algorithm and software to implement the algorithm are being developed as means to estimate the state (that is, the position and velocity) of an autonomous vehicle, relative to a visible nearby target object, to provide guidance for maneuvering the vehicle. In the original intended application, the autonomous vehicle would be a spacecraft and the nearby object would be a small astronomical body (typically, a comet or asteroid) to be explored by the spacecraft. The algorithm could also be used on Earth in analogous applications -- for example, for guiding underwater robots near such objects of interest as sunken ships, mineral deposits, or submerged mines. It is assumed that the robot would be equipped with a vision system that would include one or more electronic cameras, image-digitizing circuitry, and an imagedata- processing computer that would generate feature-recognition data products.
A nonvoxel-based dose convolution/superposition algorithm optimized for scalable GPU architectures
Neylon, J. Sheng, K.; Yu, V.; Low, D. A.; Kupelian, P.; Santhanam, A.; Chen, Q.
2014-10-15
Purpose: Real-time adaptive planning and treatment has been infeasible due in part to its high computational complexity. There have been many recent efforts to utilize graphics processing units (GPUs) to accelerate the computational performance and dose accuracy in radiation therapy. Data structure and memory access patterns are the key GPU factors that determine the computational performance and accuracy. In this paper, the authors present a nonvoxel-based (NVB) approach to maximize computational and memory access efficiency and throughput on the GPU. Methods: The proposed algorithm employs a ray-tracing mechanism to restructure the 3D data sets computed from the CT anatomy into a nonvoxel-based framework. In a process that takes only a few milliseconds of computing time, the algorithm restructured the data sets by ray-tracing through precalculated CT volumes to realign the coordinate system along the convolution direction, as defined by zenithal and azimuthal angles. During the ray-tracing step, the data were resampled according to radial sampling and parallel ray-spacing parameters making the algorithm independent of the original CT resolution. The nonvoxel-based algorithm presented in this paper also demonstrated a trade-off in computational performance and dose accuracy for different coordinate system configurations. In order to find the best balance between the computed speedup and the accuracy, the authors employed an exhaustive parameter search on all sampling parameters that defined the coordinate system configuration: zenithal, azimuthal, and radial sampling of the convolution algorithm, as well as the parallel ray spacing during ray tracing. The angular sampling parameters were varied between 4 and 48 discrete angles, while both radial sampling and parallel ray spacing were varied from 0.5 to 10 mm. The gamma distribution analysis method (γ) was used to compare the dose distributions using 2% and 2 mm dose difference and distance-to-agreement criteria
High pressure humidification columns: Design equations, algorithm, and computer code
Enick, R.M.; Klara, S.M.; Marano, J.J.
1994-07-01
This report describes the detailed development of a computer model to simulate the humidification of an air stream in contact with a water stream in a countercurrent, packed tower, humidification column. The computer model has been developed as a user model for the Advanced System for Process Engineering (ASPEN) simulator. This was done to utilize the powerful ASPEN flash algorithms as well as to provide ease of use when using ASPEN to model systems containing humidification columns. The model can easily be modified for stand-alone use by incorporating any standard algorithm for performing flash calculations. The model was primarily developed to analyze Humid Air Turbine (HAT) power cycles; however, it can be used for any application that involves a humidifier or saturator. The solution is based on a multiple stage model of a packed column which incorporates mass and energy, balances, mass transfer and heat transfer rate expressions, the Lewis relation and a thermodynamic equilibrium model for the air-water system. The inlet air properties, inlet water properties and a measure of the mass transfer and heat transfer which occur in the column are the only required input parameters to the model. Several example problems are provided to illustrate the algorithm`s ability to generate the temperature of the water, flow rate of the water, temperature of the air, flow rate of the air and humidity of the air as a function of height in the column. The algorithm can be used to model any high-pressure air humidification column operating at pressures up to 50 atm. This discussion includes descriptions of various humidification processes, detailed derivations of the relevant expressions, and methods of incorporating these equations into a computer model for a humidification column.
Special-purpose computer for holography HORN-4 with recurrence algorithm
NASA Astrophysics Data System (ADS)
Shimobaba, Tomoyoshi; Hishinuma, Sinsuke; Ito, Tomoyoshi
2002-10-01
We designed and built a special-purpose computer for holography, HORN-4 (HOlographic ReconstructioN) using PLD (Programmable Logic Device) technology. HORN computers have a pipeline architecture. We use HORN-4 as an attached processor to enhance the performance of a general-purpose computer when it is used to generate holograms using a "recurrence formulas" algorithm developed by our previous paper. In the HORN-4 system, we designed the pipeline by adopting our "recurrence formulas" algorithm which can calculate the phase on a hologram. As the result, we could integrate the pipeline composed of 21 units into one PLD chip. The units in the pipeline consists of one BPU (Basic Phase Unit) unit and twenty CU (Cascade Unit) units. These CU units can compute twenty light intensities on a hologram plane at one time. By mounting two of the PLD chips on a PCI (Peripheral Component Interconnect) universal board, HORN-4 can calculate holograms at high speed of about 42 Gflops equivalent. The cost of HORN-4 board is about 1700 US dollar. We could obtain 800×600 grids hologram from a 3D-image composed of 415 points in about 0.45 sec with the HORN-4 system.
ERIC Educational Resources Information Center
Nikolic, B.; Radivojevic, Z.; Djordjevic, J.; Milutinovic, V.
2009-01-01
Courses in Computer Architecture and Organization are regularly included in Computer Engineering curricula. These courses are usually organized in such a way that students obtain not only a purely theoretical experience, but also a practical understanding of the topics lectured. This practical work is usually done in a laboratory using simulators…
ERIC Educational Resources Information Center
Stanley, Timothy D.; Wong, Lap Kei; Prigmore, Daniel; Benson, Justin; Fishler, Nathan; Fife, Leslie; Colton, Don
2007-01-01
Students learn better when they both hear and do. In computer architecture courses "doing" can be difficult in small schools without hardware laboratories hosted by computer engineering, electrical engineering, or similar departments. Software solutions exist. Our success with George Mills' Multimedia Logic (MML) is the focus of this paper. MML…
A Project-Based Learning Approach to Programmable Logic Design and Computer Architecture
ERIC Educational Resources Information Center
Kellett, C. M.
2012-01-01
This paper describes a course in programmable logic design and computer architecture as it is taught at the University of Newcastle, Australia. The course is designed around a major design project and has two supplemental assessment tasks that are also described. The context of the Computer Engineering degree program within which the course is…
Computation of synthetic mammograms with an edge-weighting algorithm
NASA Astrophysics Data System (ADS)
Homann, Hanno; Bergner, Frank; Erhard, Klaus
2015-03-01
The promising increase in cancer detection rates1, 2 makes digital breast tomosynthesis (DBT) an interesting alternative to full-field digital mammography (FFDM) in breast cancer screening. However, this benefit comes at the cost of an increased average glandular dose in a combined DBT plus FFDM acquisition protocol. Synthetic mammograms, which are computed from the reconstructed tomosynthesis volume data, have demonstrated to be an alternative to a regular FFDM exposure in a DBT plus synthetic 2D reading mode.3 Besides weighted averaging and modified maximum intensity projection (MIP) methods,4, 5 the integration of CAD techniques for computing a weighting function in the forward projection step of the synthetic mammogram generation has been recently proposed.6, 7 In this work, a novel and computationally efficient method is presented based on an edge-retaining algorithm, which directly computes the weighting function by an edge-detection filter.
Opportunities for X-ray Science in Future Computing Architectures
Foster, Ian
2011-02-09
The world of computing continues to evolve rapidly. In just the past 10 years, we have seen the emergence of petascale supercomputing, cloud computing that provides on-demand computing and storage with considerable economies of scale, software-as-a-service methods that permit outsourcing of complex processes, and grid computing that enables federation of resources across institutional boundaries. These trends show no sign of slowing down. The next 10 years will surely see exascale, new cloud offerings, and other terabit networks. This talk reviews various of these developments and discusses their potential implications for x-ray science and x-ray facilities.
A Moving Target Environment for Computer Configurations Using Genetic Algorithms
Crouse, Michael; Fulp, Errin W.
2011-10-31
Moving Target (MT) environments for computer systems provide security through diversity by changing various system properties that are explicitly defined in the computer configuration. Temporal diversity can be achieved by making periodic configuration changes; however in an infrastructure of multiple similarly purposed computers diversity must also be spatial, ensuring multiple computers do not simultaneously share the same configuration and potential vulnerabilities. Given the number of possible changes and their potential interdependencies discovering computer configurations that are secure, functional, and diverse is challenging. This paper describes how a Genetic Algorithm (GA) can be employed to find temporally and spatially diverse secure computer configurations. In the proposed approach a computer configuration is modeled as a chromosome, where an individual configuration setting is a trait or allele. The GA operates by combining multiple chromosomes (configurations) which are tested for feasibility and ranked based on performance which will be measured as resistance to attack. The result of successive iterations of the GA are secure configurations that are diverse due to the crossover and mutation processes. Simulations results will demonstrate this approach can provide at MT environment for a large infrastructure of similarly purposed computers by discovering temporally and spatially diverse secure configurations.
Sort-Mid tasks scheduling algorithm in grid computing.
Reda, Naglaa M; Tawfik, A; Marzok, Mohamed A; Khamis, Soheir M
2015-11-01
Scheduling tasks on heterogeneous resources distributed over a grid computing system is an NP-complete problem. The main aim for several researchers is to develop variant scheduling algorithms for achieving optimality, and they have shown a good performance for tasks scheduling regarding resources selection. However, using of the full power of resources is still a challenge. In this paper, a new heuristic algorithm called Sort-Mid is proposed. It aims to maximizing the utilization and minimizing the makespan. The new strategy of Sort-Mid algorithm is to find appropriate resources. The base step is to get the average value via sorting list of completion time of each task. Then, the maximum average is obtained. Finally, the task has the maximum average is allocated to the machine that has the minimum completion time. The allocated task is deleted and then, these steps are repeated until all tasks are allocated. Experimental tests show that the proposed algorithm outperforms almost other algorithms in terms of resources utilization and makespan. PMID:26644937
Sort-Mid tasks scheduling algorithm in grid computing.
Reda, Naglaa M; Tawfik, A; Marzok, Mohamed A; Khamis, Soheir M
2015-11-01
Scheduling tasks on heterogeneous resources distributed over a grid computing system is an NP-complete problem. The main aim for several researchers is to develop variant scheduling algorithms for achieving optimality, and they have shown a good performance for tasks scheduling regarding resources selection. However, using of the full power of resources is still a challenge. In this paper, a new heuristic algorithm called Sort-Mid is proposed. It aims to maximizing the utilization and minimizing the makespan. The new strategy of Sort-Mid algorithm is to find appropriate resources. The base step is to get the average value via sorting list of completion time of each task. Then, the maximum average is obtained. Finally, the task has the maximum average is allocated to the machine that has the minimum completion time. The allocated task is deleted and then, these steps are repeated until all tasks are allocated. Experimental tests show that the proposed algorithm outperforms almost other algorithms in terms of resources utilization and makespan.
Sort-Mid tasks scheduling algorithm in grid computing
Reda, Naglaa M.; Tawfik, A.; Marzok, Mohamed A.; Khamis, Soheir M.
2014-01-01
Scheduling tasks on heterogeneous resources distributed over a grid computing system is an NP-complete problem. The main aim for several researchers is to develop variant scheduling algorithms for achieving optimality, and they have shown a good performance for tasks scheduling regarding resources selection. However, using of the full power of resources is still a challenge. In this paper, a new heuristic algorithm called Sort-Mid is proposed. It aims to maximizing the utilization and minimizing the makespan. The new strategy of Sort-Mid algorithm is to find appropriate resources. The base step is to get the average value via sorting list of completion time of each task. Then, the maximum average is obtained. Finally, the task has the maximum average is allocated to the machine that has the minimum completion time. The allocated task is deleted and then, these steps are repeated until all tasks are allocated. Experimental tests show that the proposed algorithm outperforms almost other algorithms in terms of resources utilization and makespan. PMID:26644937
Orlando, Roberto; Delle Piane, Massimo; Bush, Ian J; Ugliengo, Piero; Ferrabone, Matteo; Dovesi, Roberto
2012-10-30
Fully ab initio treatment of complex solid systems needs computational software which is able to efficiently take advantage of the growing power of high performance computing (HPC) architectures. Recent improvements in CRYSTAL, a periodic ab initio code that uses a Gaussian basis set, allows treatment of very large unit cells for crystalline systems on HPC architectures with high parallel efficiency in terms of running time and memory requirements. The latter is a crucial point, due to the trend toward architectures relying on a very high number of cores with associated relatively low memory availability. An exhaustive performance analysis shows that density functional calculations, based on a hybrid functional, of low-symmetry systems containing up to 100,000 atomic orbitals and 8000 atoms are feasible on the most advanced HPC architectures available to European researchers today, using thousands of processors.
NASA Astrophysics Data System (ADS)
Gurcan, Metin N.; Chan, Heang-Ping; Sahiner, Berkman; Hadjiiski, Lubomir M.; Petrick, Nicholas; Helvie, Mark A.
2002-05-01
We evaluated the effectiveness of an optimal convolution neural network (CNN) architecture selected by simulated annealing for improving the performance of a computer-aided diagnosis (CAD) system designed for the detection of microcalcification clusters on digitized mammograms. The performances of the CAD programs with manually and optimally selected CNNs were compared using an independent test set. This set included 472 mammograms and contained 253 biopsy-proven malignant clusters. Free-response receiver operating characteristic (FROC) analysis was used for evaluation of the detection accuracy. At a false positive (FP) rate of 0.7 per image, the film-based sensitivity was 84.6% with the optimized CNN, in comparison with 77.2% with the manually selected CNN. If clusters having images in both craniocaudal and mediolateral oblique views were analyzed together and a cluster was considered to be detected when it was detected in one or both views, at 0.7 FPs/image, the sensitivity was 93.3% with the optimized CNN and 87.0% with the manually selected CNN. This study indicates that classification of true positive and FP signals is an important step of the CAD program and that the detection accuracy of the program can be considerably improved by optimizing this step with an automated optimization algorithm.
A learnable parallel processing architecture towards unity of memory and computing
NASA Astrophysics Data System (ADS)
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-08-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-08-14
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.
A learnable parallel processing architecture towards unity of memory and computing.
Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J
2015-01-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243
A learnable parallel processing architecture towards unity of memory and computing
Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.
2015-01-01
Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area. PMID:26271243
Distributed Computing Architecture for Image-Based Wavefront Sensing and 2 D FFTs
NASA Technical Reports Server (NTRS)
Smith, Jeffrey S.; Dean, Bruce H.; Haghani, Shadan
2006-01-01
Image-based wavefront sensing (WFS) provides significant advantages over interferometric-based wavefi-ont sensors such as optical design simplicity and stability. However, the image-based approach is computational intensive, and therefore, specialized high-performance computing architectures are required in applications utilizing the image-based approach. The development and testing of these high-performance computing architectures are essential to such missions as James Webb Space Telescope (JWST), Terrestial Planet Finder-Coronagraph (TPF-C and CorSpec), and Spherical Primary Optical Telescope (SPOT). The development of these specialized computing architectures require numerous two-dimensional Fourier Transforms, which necessitate an all-to-all communication when applied on a distributed computational architecture. Several solutions for distributed computing are presented with an emphasis on a 64 Node cluster of DSPs, multiple DSP FPGAs, and an application of low-diameter graph theory. Timing results and performance analysis will be presented. The solutions offered could be applied to other all-to-all communication and scientifically computationally complex problems.
Arranging computer architectures to create higher-performance controllers
NASA Technical Reports Server (NTRS)
Jacklin, Stephen A.
1988-01-01
Techniques for integrating microprocessors, array processors, and other intelligent devices in control systems are reviewed, with an emphasis on the (re)arrangement of components to form distributed or parallel processing systems. Consideration is given to the selection of the host microprocessor, increasing the power and/or memory capacity of the host, multitasking software for the host, array processors to reduce computation time, the allocation of real-time and non-real-time events to different computer subsystems, intelligent devices to share the computational burden for real-time events, and intelligent interfaces to increase communication speeds. The case of a helicopter vibration-suppression and stabilization controller is analyzed as an example, and significant improvements in computation and throughput rates are demonstrated.
An efficient parallel algorithm for accelerating computational protein design
Zhou, Yichao; Xu, Wei; Donald, Bruce R.; Zeng, Jianyang
2014-01-01
Motivation: Structure-based computational protein design (SCPR) is an important topic in protein engineering. Under the assumption of a rigid backbone and a finite set of discrete conformations of side-chains, various methods have been proposed to address this problem. A popular method is to combine the dead-end elimination (DEE) and A* tree search algorithms, which provably finds the global minimum energy conformation (GMEC) solution. Results: In this article, we improve the efficiency of computing A* heuristic functions for protein design and propose a variant of A* algorithm in which the search process can be performed on a single GPU in a massively parallel fashion. In addition, we make some efforts to address the memory exceeding problem in A* search. As a result, our enhancements can achieve a significant speedup of the A*-based protein design algorithm by four orders of magnitude on large-scale test data through pre-computation and parallelization, while still maintaining an acceptable memory overhead. We also show that our parallel A* search algorithm could be successfully combined with iMinDEE, a state-of-the-art DEE criterion, for rotamer pruning to further improve SCPR with the consideration of continuous side-chain flexibility. Availability: Our software is available and distributed open-source under the GNU Lesser General License Version 2.1 (GNU, February 1999). The source code can be downloaded from http://www.cs.duke.edu/donaldlab/osprey.php or http://iiis.tsinghua.edu.cn/∼compbio/software.html. Contact: zengjy321@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24931991
Region-Oriented Placement Algorithm for Coarse-Grained Power-Gating FPGA Architecture
NASA Astrophysics Data System (ADS)
Li, Ce; Dong, Yiping; Watanabe, Takahiro
An FPGA plays an essential role in industrial products due to its fast, stable and flexible features. But the power consumption of FPGAs used in portable devices is one of critical issues. Top-down hierarchical design method is commonly used in both ASIC and FPGA design. But, in the case where plural modules are integrated in an FPGA and some of them might be in sleep-mode, current FPGA architecture cannot be fully effective. In this paper, coarse-grained power gating FPGA architecture is proposed where a whole area of an FPGA is partitioned into several regions and power supply is controlled for each region, so that modules in sleep mode can be effectively power-off. We also propose a region oriented FPGA placement algorithm fitted to this user's hierarchical design based on VPR[1]. Simulation results show that this proposed method could reduce power consumption of FPGA by 38% on average by setting unused modules or regions in sleep mode.
Algorithms for computing and integrating physical maps using unique probes.
Jain, M; Myers, E W
1997-01-01
Current physical mapping projects based on STS-probes involve additional clues such as the fact that some probes are anchored to a known map and that others come from the ends of clones. Because of the disparate combinatorial contributions of these varied data items, it is difficult to design a "tailored" algorithm that incorporates them all. Moreover, it is inevitable that new experiments will provide new kinds of data, making obsolete any such algorithm. We show how to convert the physical mapping problem into a 0/1 linear programming (LP) problem. We further show how one can incorporate additional clues as additional constraints in the LP formulation. We give a simple relaxation of the 0/1 LP problem, which solves problems of the same scale as previously reported tailored algorithms, to equal or greater optimization levels. We also present a theorem proving that when the data is 100% accurate, then the relaxed and integer solutions coincide. The LP algorithm suffices to solve problems on the order of 80-100 probes--the typical size of the 2- or 3-connected contigs of Arratia et al. (1991). We give a heuristic algorithm which attempts to order and link the set of LP-solved contigs. Unlike previous work, this algorithm only links and orders contigs when the join is 90% or more likely to be correct. It is our view that there is no value in computing an optimal solution with respect to some criteria over very noisy data as this optimal solution rarely corresponds to the true solution. The paper involves extensive empirical trials over real and simulated data. PMID:9385539
Computer-Aided Design of Organic Host Architectures for Selective Chemosensors
Hay, Benjamin; Bryantsev, Vyacheslav S.
2009-01-01
Selective organic hosts provide the foundation for the development of many types of sensors. The deliberate design of host molecules with predetermined selectivity, however, remains a challenge in supramolecular chemistry. To address this issue we have developed a de novo structure-based design approach for the unbiased construction of complementary host architectures. This chapter summarizes recent progress including improvements on a computer software program, HostDesigner, specifically tailored to discover host architectures for small guest molecules. HostDesigner is capable of generating and evaluating millions of candidate structures in minutes on a desktop personal computer, allowing a user to rapidly identify three-dimensional architectures that are structurally organized for binding a targeted guest species. The efficacy of this computational methodology is illustrated with a search for cation hosts containing aliphatic ether oxygen groups and anion hosts containing urea groups.
Computer aided lung cancer diagnosis with deep learning algorithms
NASA Astrophysics Data System (ADS)
Sun, Wenqing; Zheng, Bin; Qian, Wei
2016-03-01
Deep learning is considered as a popular and powerful method in pattern recognition and classification. However, there are not many deep structured applications used in medical imaging diagnosis area, because large dataset is not always available for medical images. In this study we tested the feasibility of using deep learning algorithms for lung cancer diagnosis with the cases from Lung Image Database Consortium (LIDC) database. The nodules on each computed tomography (CT) slice were segmented according to marks provided by the radiologists. After down sampling and rotating we acquired 174412 samples with 52 by 52 pixel each and the corresponding truth files. Three deep learning algorithms were designed and implemented, including Convolutional Neural Network (CNN), Deep Belief Networks (DBNs), Stacked Denoising Autoencoder (SDAE). To compare the performance of deep learning algorithms with traditional computer aided diagnosis (CADx) system, we designed a scheme with 28 image features and support vector machine. The accuracies of CNN, DBNs, and SDAE are 0.7976, 0.8119, and 0.7929, respectively; the accuracy of our designed traditional CADx is 0.7940, which is slightly lower than CNN and DBNs. We also noticed that the mislabeled nodules using DBNs are 4% larger than using traditional CADx, this might be resulting from down sampling process lost some size information of the nodules.
Final Report: Super Instruction Architecture for Scalable Parallel Computations
Sanders, Beverly Ann; Bartlett, Rodney; Deumens, Erik
2013-12-23
The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.
A Survey of Architectural Techniques for Near-Threshold Computing
Mittal, Sparsh
2015-12-28
Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. In low-voltage computing and specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate and performance loss etc. Our paper surveys several re- cent techniques which aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlightmore » their similarities and differences. Ultimately, we hope that this paper will provide insights into state-of-art NTC techniques to researchers and system-designers and inspire further research in this field.« less
A Survey of Architectural Techniques for Near-Threshold Computing
Mittal, Sparsh
2015-12-28
Energy efficiency has now become the primary obstacle in scaling the performance of all classes of computing systems. In low-voltage computing and specifically, near-threshold voltage computing (NTC), which involves operating the transistor very close to and yet above its threshold voltage, holds the promise of providing many-fold improvement in energy efficiency. However, use of NTC also presents several challenges such as increased parametric variation, failure rate and performance loss etc. Our paper surveys several re- cent techniques which aim to offset these challenges for fully leveraging the potential of NTC. By classifying these techniques along several dimensions, we also highlight their similarities and differences. Ultimately, we hope that this paper will provide insights into state-of-art NTC techniques to researchers and system-designers and inspire further research in this field.
A language comparison for scientific computing on MIMD architectures
NASA Technical Reports Server (NTRS)
Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.
1989-01-01
Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.
A multitasking finite state architecture for computer control of an electric powertrain
Burba, J.C.
1984-01-01
Finite state techniques provide a common design language between the control engineer and the computer engineer for event driven computer control systems. They simplify communication and provide a highly maintainable control system understandable by both. This paper describes the development of a control system for an electric vehicle powertrain utilizing finite state concepts. The basics of finite state automata are provided as a framework to discuss a unique multitasking software architecture developed for this application. The architecture employs conventional time-sliced techniques with task scheduling controlled by a finite state machine representation of the control strategy of the powertrain. The complexities of excitation variable sampling in this environment are also considered.
A comparison of computer architectures for the NASA demonstration advanced avionics system
NASA Technical Reports Server (NTRS)
Seacord, C. L.; Bailey, D. G.; Larson, J. C.
1979-01-01
The paper compares computer architectures for the NASA demonstration advanced avionics system. Two computer architectures are described with an unusual approach to fault tolerance: a single spare processor can correct for faults in any of the distributed processors by taking on the role of a failed module. It was shown the system must be used from a functional point of view to properly apply redundancy and achieve fault tolerance and ultra reliability. Data are presented on complexity and mission failure probability which show that the revised version offers equivalent mission reliability at lower cost as measured by hardware and software complexity.
NASA Technical Reports Server (NTRS)
Saltz, J. H.; Naik, V. K.; Nicol, D. M.
1986-01-01
The efficient implementation of algorithms on multiprocessor machines requires that the effects of communication delays be minimized. The effects of these delays on the performance of a model problem on a hypercube multiprocessor architecture is investigated and methods are developed for increasing algorithm efficiency. The model problem under investigation is the solution by red-black Successive Over Relaxation YOUN71 of the heat equation; most of the techniques described here also apply equally well to the solution of elliptic partial differential equations by red-black or multicolor SOR methods. Methods for reducing communication traffic and overhead on a multiprocessor are identified and results of testing these methods on the Intel iPSC Hypercube reported. Methods for partitioning a problem's domain across processors, for reducing communication traffic during a global convergence check, for reducing the number of global convergence checks employed during an iteration, and for concurrently iterating on multiple time-steps in a time-dependent problem. Empirical results show that use of these models can markedly reduce a numewrical problem's execution time.
A Low Cost Microcomputer Laboratory for Investigating Computer Architecture.
ERIC Educational Resources Information Center
Mitchell, Eugene E., Ed.
1980-01-01
Described is a microcomputer laboratory at the United States Military Academy at West Point, New York, which provides easy access to non-volatile memory and a single input/output file system for 16 microcomputer laboratory positions. A microcomputer network that has a centralized data base is implemented using the concepts of computer network…
CSP: A Multifaceted Hybrid Architecture for Space Computing
NASA Technical Reports Server (NTRS)
Rudolph, Dylan; Wilson, Christopher; Stewart, Jacob; Gauvin, Patrick; George, Alan; Lam, Herman; Crum, Gary Alex; Wirthlin, Mike; Wilson, Alex; Stoddard, Aaron
2014-01-01
Research on the CHREC Space Processor (CSP) takes a multifaceted hybrid approach to embedded space computing. Working closely with the NASA Goddard SpaceCube team, researchers at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Florida and Brigham Young University are developing hybrid space computers that feature an innovative combination of three technologies: commercial-off-the-shelf (COTS) devices, radiation-hardened (RadHard) devices, and fault-tolerant computing. Modern COTS processors provide the utmost in performance and energy-efficiency but are susceptible to ionizing radiation in space, whereas RadHard processors are virtually immune to this radiation but are more expensive, larger, less energy-efficient, and generations behind in speed and functionality. By featuring COTS devices to perform the critical data processing, supported by simpler RadHard devices that monitor and manage the COTS devices, and augmented with novel uses of fault-tolerant hardware, software, information, and networking within and between COTS devices, the resulting system can maximize performance and reliability while minimizing energy consumption and cost. NASA Goddard has adopted the CSP concept and technology with plans underway to feature flight-ready CSP boards on two upcoming space missions.
An Architectural Design System Based on Computer Graphics.
ERIC Educational Resources Information Center
MacDonald, Stephen L.; Wehrli, Robert
The recent developments in computer hardware and software are presented to inform architects of this design tool. Technical advancements in equipment include--(1) cathode ray tube displays, (2) light pens, (3) print-out and photo copying attachments, (4) controls for comparison and selection of images, (5) chording keyboards, (6) plotters, and (7)…
Comparison of Two Computer Algorithms To Identify Surgical Site Infections
Apte, Mandar; Landers, Timothy; Furuya, Yoko; Hyman, Sandra
2011-01-01
Abstract Background Surgical site infections (SSIs), the second most common healthcare-associated infections, increase hospital stay and healthcare costs significantly. Traditional surveillance of SSIs is labor-intensive. Mandatory reporting and new non-payment policies for some SSIs increase the need for efficient and standardized surveillance methods. Computer algorithms using administrative, clinical, and laboratory data collected routinely have shown promise for complementing traditional surveillance. Methods Two computer algorithms were created to identify SSIs in inpatient admissions to an urban, academic tertiary-care hospital in 2007 using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnosis codes (Rule A) and laboratory culture data (Rule B). We calculated the number of SSIs identified by each rule and both rules combined and the percent agreement between the rules. In a subset analysis, the results of the rules were compared with those of traditional surveillance in patients who had undergone coronary artery bypass graft surgery (CABG). Results Of the 28,956 index hospital admissions, 5,918 patients (20.4%) had at least one major surgical procedure. Among those and readmissions within 30 days, the ICD-9-CM-only rule identified 235 SSIs, the culture-only rule identified 287 SSIs; combined, the rules identified 426 SSIs, of which 96 were identified by both rules. Positive and negative agreement between the rules was 36.8% and 97.1%, respectively, with a kappa of 0.34 (95% confidence interval [CI] 0.27–0.41). In the subset analysis of patients who underwent CABG, of the 22 SSIs identified by traditional surveillance, Rule A identified 19 (86.4%) and Rule B identified 13 (59.1%) cases. Positive and negative agreement between Rules A and B within these “positive controls” was 81.3% and 50.0% with a kappa of 0.37 (95% CI 0.04–0.70). Conclusion Differences in the rates of SSI identified by computer
Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment
NASA Technical Reports Server (NTRS)
Duong, Tuan A.; Duong, Vu A.
2012-01-01
A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.
Conservative Grid-Interface Algorithm For Computing Flows
NASA Technical Reports Server (NTRS)
Klopfer, G. H.; Molvik, G. A.
1992-01-01
Best features of structured- and unstructured-grid methods combined. Gaps and overlaps between zonal grids eliminated by grid-interface algorithm, which generates single interfacial grid and corrects fluxes of flow quantities accordingly. Incorporated into two three-dimensional Navier-Stokes finite-volume codes and tested in computations of incompressible and compressible flows about simple bodies. Good numerical results obtained. General enough to be incorporated into other finite-volume codes without restrictions on complexities of shapes of bodies and zonal interfaces.
Efficient quantum algorithm for computing n-time correlation functions.
Pedernales, J S; Di Candia, R; Egusquiza, I L; Casanova, J; Solano, E
2014-07-11
We propose a method for computing n-time correlation functions of arbitrary spinorial, fermionic, and bosonic operators, consisting of an efficient quantum algorithm that encodes these correlations in an initially added ancillary qubit for probe and control tasks. For spinorial and fermionic systems, the reconstruction of arbitrary n-time correlation functions requires the measurement of two ancilla observables, while for bosonic variables time derivatives of the same observables are needed. Finally, we provide examples applicable to different quantum platforms in the frame of the linear response theory.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
An Evaluation of Biosurveillance Grid—Dynamic Algorithm Distribution Across Multiple Computer Nodes
Tsai, Ming-Chi; Tsui, Fu-Chiang; Wagner, Michael M.
2007-01-01
Performing fast data analysis to detect disease outbreaks plays a critical role in real-time biosurveillance. In this paper, we described and evaluated an Algorithm Distribution Manager Service (ADMS) based on grid technologies, which dynamically partition and distribute detection algorithms across multiple computers. We compared the execution time to perform the analysis on a single computer and on a grid network (3 computing nodes) with and without using dynamic algorithm distribution. We found that algorithms with long runtime completed approximately three times earlier in distributed environment than in a single computer while short runtime algorithms performed worse in distributed environment. A dynamic algorithm distribution approach also performed better than static algorithm distribution approach. This pilot study shows a great potential to reduce lengthy analysis time through dynamic algorithm partitioning and parallel processing, and provides the opportunity of distributing algorithms from a client to remote computers in a grid network. PMID:18693936
How computer science can help in understanding the 3D genome architecture.
Shavit, Yoli; Merelli, Ivan; Milanesi, Luciano; Lio', Pietro
2016-09-01
Chromosome conformation capture techniques are producing a huge amount of data about the architecture of our genome. These data can provide us with a better understanding of the events that induce critical regulations of the cellular function from small changes in the three-dimensional genome architecture. Generating a unified view of spatial, temporal, genetic and epigenetic properties poses various challenges of data analysis, visualization, integration and mining, as well as of high performance computing and big data management. Here, we describe the critical issues of this new branch of bioinformatics, oriented at the comprehension of the three-dimensional genome architecture, which we call 'Nucleome Bioinformatics', looking beyond the currently available tools and methods, and highlight yet unaddressed challenges and the potential approaches that could be applied for tackling them. Our review provides a map for researchers interested in using computer science for studying 'Nucleome Bioinformatics', to achieve a better understanding of the biological processes that occur inside the nucleus.
Not Available
1994-09-01
This third volume of the Information Management Architecture for an Integrated Computing Environment for the Environmental Restoration Program--the Interim Technical Architecture (TA) (referred to throughout the remainder of this document as the ER TA)--represents a key milestone in establishing a coordinated information management environment in which information initiatives can be pursued with the confidence that redundancy and inconsistencies will be held to a minimum. This architecture is intended to be used as a reference by anyone whose responsibilities include the acquisition or development of information technology for use by the ER Program. The interim ER TA provides technical guidance at three levels. At the highest level, the technical architecture provides an overall computing philosophy or direction. At this level, the guidance does not address specific technologies or products but addresses more general concepts, such as the use of open systems, modular architectures, graphical user interfaces, and architecture-based development. At the next level, the technical architecture provides specific information technology recommendations regarding a wide variety of specific technologies. These technologies include computing hardware, operating systems, communications software, database management software, application development software, and personal productivity software, among others. These recommendations range from the adoption of specific industry or Martin Marietta Energy Systems, Inc. (Energy Systems) standards to the specification of individual products. At the third level, the architecture provides guidance regarding implementation strategies for the recommended technologies that can be applied to individual projects and to the ER Program as a whole.
Towards automatic Markov reliability modeling of computer architectures
NASA Technical Reports Server (NTRS)
Liceaga, C. A.; Siewiorek, D. P.
1986-01-01
The analysis and evaluation of reliability measures using time-varying Markov models is required for Processor-Memory-Switch (PMS) structures that have competing processes such as standby redundancy and repair, or renewal processes such as transient or intermittent faults. The task of generating these models is tedious and prone to human error due to the large number of states and transitions involved in any reasonable system. Therefore model formulation is a major analysis bottleneck, and model verification is a major validation problem. The general unfamiliarity of computer architects with Markov modeling techniques further increases the necessity of automating the model formulation. This paper presents an overview of the Automated Reliability Modeling (ARM) program, under development at NASA Langley Research Center. ARM will accept as input a description of the PMS interconnection graph, the behavior of the PMS components, the fault-tolerant strategies, and the operational requirements. The output of ARM will be the reliability of availability Markov model formulated for direct use by evaluation programs. The advantages of such an approach are (a) utility to a large class of users, not necessarily expert in reliability analysis, and (b) a lower probability of human error in the computation.
Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; Maris, Pieter; Vary, James P.
2015-07-14
In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important features of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.
Oryspayev, Dossay; Aktulga, Hasan Metin; Sosonkina, Masha; Maris, Pieter; Vary, James P.
2015-07-14
In this article, sparse matrix vector multiply (SpMVM) is an important kernel that frequently arises in high performance computing applications. Due to its low arithmetic intensity, several approaches have been proposed in literature to improve its scalability and efficiency in large scale computations. In this paper, our target systems are high end multi-core architectures and we use messaging passing interface + open multiprocessing hybrid programming model for parallelism. We analyze the performance of recently proposed implementation of the distributed symmetric SpMVM, originally developed for large sparse symmetric matrices arising in ab initio nuclear structure calculations. We also study important featuresmore » of this implementation and compare with previously reported implementations that do not exploit underlying symmetry. Our SpMVM implementations leverage the hybrid paradigm to efficiently overlap expensive communications with computations. Our main comparison criterion is the "CPU core hours" metric, which is the main measure of resource usage on supercomputers. We analyze the effects of topology-aware mapping heuristic using simplified network load model. Furthermore, we have tested the different SpMVM implementations on two large clusters with 3D Torus and Dragonfly topology. Our results show that the distributed SpMVM implementation that exploits matrix symmetry and hides communication yields the best value for the "CPU core hours" metric and significantly reduces data movement overheads.« less
Efficient computer algebra algorithms for polynomial matrices in control design
NASA Technical Reports Server (NTRS)
Baras, J. S.; Macenany, D. C.; Munach, R.
1989-01-01
The theory of polynomial matrices plays a key role in the design and analysis of multi-input multi-output control and communications systems using frequency domain methods. Examples include coprime factorizations of transfer functions, cannonical realizations from matrix fraction descriptions, and the transfer function design of feedback compensators. Typically, such problems abstract in a natural way to the need to solve systems of Diophantine equations or systems of linear equations over polynomials. These and other problems involving polynomial matrices can in turn be reduced to polynomial matrix triangularization procedures, a result which is not surprising given the importance of matrix triangularization techniques in numerical linear algebra. Matrices with entries from a field and Gaussian elimination play a fundamental role in understanding the triangularization process. In the case of polynomial matrices, matrices with entries from a ring for which Gaussian elimination is not defined and triangularization is accomplished by what is quite properly called Euclidean elimination. Unfortunately, the numerical stability and sensitivity issues which accompany floating point approaches to Euclidean elimination are not very well understood. New algorithms are presented which circumvent entirely such numerical issues through the use of exact, symbolic methods in computer algebra. The use of such error-free algorithms guarantees that the results are accurate to within the precision of the model data--the best that can be hoped for. Care must be taken in the design of such algorithms due to the phenomenon of intermediate expressions swell.
Initial condition for efficient mapping of level set algorithms on many-core architectures
NASA Astrophysics Data System (ADS)
Tornai, Gábor János; Cserey, György
2014-12-01
In this paper, we investigated the effect of adding more small curves to the initial condition which determines the required number of iterations of a fast level set (LS) evolution. As a result, we discovered two new theorems and developed a proof on the worst case of the required number of iterations. Furthermore, we found that these kinds of initial conditions fit well to many-core architectures. To show this, we have included two case studies which are presented on different platforms. One runs on a graphical processing unit (GPU) and the other is executed on a cellular nonlinear network universal machine (CNN-UM). With the new initial conditions, the steady-state solutions of the LS are reached in less than eight iterations depending on the granularity of the initial condition. These dense iterations can be calculated very quickly on many-core platforms according to the two case studies. In the case of the proposed dense initial condition on GPU, there is a significant speedup compared to the sparse initial condition in all cases since our dense initial condition together with the algorithm utilizes the properties of the underlying architecture. Therefore, greater performance gain can be achieved (up to 18 times speedup compared to the sparse initial condition on GPU). Additionally, we have validated our concept against numerically approximated LS evolution of standard flows (mean curvature, Chan-Vese, geodesic active regions). The dice indexes between the fast LS evolutions and the evolutions of the numerically approximated partial differential equations are in the range of 0.99±0.003.
NASA Astrophysics Data System (ADS)
Ullah, Muhammed Zafar
Neural Network and Fuzzy Logic are the two key technologies that have recently received growing attention in solving real world, nonlinear, time variant problems. Because of their learning and/or reasoning capabilities, these techniques do not need a mathematical model of the system, which may be difficult, if not impossible, to obtain for complex systems. One of the major problems in portable or electric vehicle world is secondary cell charging, which shows non-linear characteristics. Portable-electronic equipment, such as notebook computers, cordless and cellular telephones and cordless-electric lawn tools use batteries in increasing numbers. These consumers demand fast charging times, increased battery lifetime and fuel gauge capabilities. All of these demands require that the state-of-charge within a battery be known. Charging secondary cells Fast is a problem, which is difficult to solve using conventional techniques. Charge control is important in fast charging, preventing overcharging and improving battery life. This research work provides a quick and reliable approach to charger design using Neural-Fuzzy technology, which learns the exact battery charging characteristics. Neural-Fuzzy technology is an intelligent combination of neural net with fuzzy logic that learns system behavior by using system input-output data rather than mathematical modeling. The primary objective of this research is to improve the secondary cell charging algorithm and to have faster charging time based on neural network and fuzzy logic technique. Also a new architecture of a controller will be developed for implementing the charging algorithm for the secondary battery.
Usage of Thin-Client/Server Architecture in Computer Aided Education
ERIC Educational Resources Information Center
Cimen, Caghan; Kavurucu, Yusuf; Aydin, Halit
2014-01-01
With the advances of technology, thin-client/server architecture has become popular in multi-user/single network environments. Thin-client is a user terminal in which the user can login to a domain and run programs by connecting to a remote server. Recent developments in network and hardware technologies (cloud computing, virtualization, etc.)…
p88110: A Graphical Simulator for Computer Architecture and Organization Courses
ERIC Educational Resources Information Center
Garcia, M. I.; Rodriguez, S.; Perez, A.; Garcia, A.
2009-01-01
Studying fundamental Computer Architecture and Organization topics requires a significant amount of practical work if students are to acquire a good grasp of the theoretical concepts presented in classroom lectures or textbooks. The use of simulators is commonly adopted in order to reach this objective. However, as most of the available…
ERIC Educational Resources Information Center
Hung, Y.-C.
2012-01-01
This paper investigates the impact of combining self explaining (SE) with computer architecture diagrams to help novice students learn assembly language programming. Pre- and post-test scores for the experimental and control groups were compared and subjected to covariance (ANCOVA) statistical analysis. Results indicate that the SE-plus-diagram…
Computer models in room acoustics: The ray tracing method and the auralization algorithms
NASA Astrophysics Data System (ADS)
Pompei, Anna; Sumbatyan, M. A.; Todorov, N. F.
2009-11-01
Computer algorithms are described for constructing virtual acoustic models of various rooms that should satisfy some specific sound quality criteria. The algorithms are based on the ray tracing method, which, in the general case, allows calculation of the amplitude of an acoustic ray that survived multiple reflections from arbitrary curved surfaces. As a result, calculations of room acoustics are reduced to tracing the trajectories of all the acoustic rays in the course of their propagation with multiple reflections from reflecting surfaces to the point of their complete decay. For this approach to be used, the following physical properties of a room should be known: the geometry of the reflecting surfaces, the absorption and diffusion coefficients on each of these surfaces, and the decay law for rays propagating in air. The proposed models allow for the solution of the important problem of architectural acoustics called the auralization problem, i.e., to predict how any given audio segment will sound in any given hall on the basis of computer simulation alone, without any full-scale testing in specific halls.
A Fast Full Tensor Gravity computation algorithm for High Resolution 3D Geologic Interpretations
NASA Astrophysics Data System (ADS)
Jayaram, V.; Crain, K.; Keller, G. R.
2011-12-01
We present an algorithm to rapidly calculate the vertical gravity and full tensor gravity (FTG) values due to a 3-D geologic model. This algorithm can be implemented on single, multi-core CPU and graphical processing units (GPU) architectures. Our technique is based on the line element approximation with a constant density within each grid cell. This type of parameterization is well suited for high-resolution elevation datasets with grid size typically in the range of 1m to 30m. The large high-resolution data grids in our studies employ a pre-filtered mipmap pyramid type representation for the grid data known as the Geometry clipmap. The clipmap was first introduced by Microsoft Research in 2004 to do fly-through terrain visualization. This method caches nested rectangular extents of down-sampled data layers in the pyramid to create view-dependent calculation scheme. Together with the simple grid structure, this allows the gravity to be computed conveniently on-the-fly, or stored in a highly compressed format. Neither of these capabilities has previously been available. Our approach can perform rapid calculations on large topographies including crustal-scale models derived from complex geologic interpretations. For example, we used a 1KM Sphere model consisting of 105000 cells at 10m resolution with 100000 gravity stations. The line element approach took less than 90 seconds to compute the FTG and vertical gravity on an Intel Core i7 CPU at 3.07 GHz utilizing just its single core. Also, unlike traditional gravity computational algorithms, the line-element approach can calculate gravity effects at locations interior or exterior to the model. The only condition that must be met is the observation point cannot be located directly above the line element. Therefore, we perform a location test and then apply appropriate formulation to those data points. We will present and compare the computational performance of the traditional prism method versus the line element
WISDOM: A prototype office implementation of the SRS computing architecture
Eckert, D.W.
1991-12-31
The Savannah River Site has historically allowed the purchase of IBM MS-DOS and Apple Macintosh computers based on user request. As workgroup file services are implemented on the Local Area Network users desire to share data to a greater extent. This has resulted in mixed groups who now wish to share data files cleanly among dissimilar operating systems. WISDOM was designed as a system of network services, workstation platform standards, installation procedures, and application choices which would address data integration from the user perspective. Novell Netware provides a basis for file transfer, while Microsoft Windows supplies the GUI necessary to compliment the Macintosh. Central administration, networking protocols, host connectivity, and memory management restrictions required imaginative solutions. This paper describes the current status of the 500-workstation prototype; user acceptance and training; and outstanding issues to be addressed. Details are given on the design philosophy, some of the technology utilized, the implementation process, and future directions.
WISDOM: A prototype office implementation of the SRS computing architecture
Eckert, D.W.
1991-01-01
The Savannah River Site has historically allowed the purchase of IBM MS-DOS and Apple Macintosh computers based on user request. As workgroup file services are implemented on the Local Area Network users desire to share data to a greater extent. This has resulted in mixed groups who now wish to share data files cleanly among dissimilar operating systems. WISDOM was designed as a system of network services, workstation platform standards, installation procedures, and application choices which would address data integration from the user perspective. Novell Netware provides a basis for file transfer, while Microsoft Windows supplies the GUI necessary to compliment the Macintosh. Central administration, networking protocols, host connectivity, and memory management restrictions required imaginative solutions. This paper describes the current status of the 500-workstation prototype; user acceptance and training; and outstanding issues to be addressed. Details are given on the design philosophy, some of the technology utilized, the implementation process, and future directions.
An Object-Oriented Network-Centric Software Architecture for Physical Computing
NASA Astrophysics Data System (ADS)
Palmer, Richard
1997-08-01
Recent developments in object-oriented computer languages and infrastructure such as the Internet, Web browsers, and the like provide an opportunity to define a more productive computational environment for scientific programming that is based more closely on the underlying mathematics describing physics than traditional programming languages such as FORTRAN or C++. In this talk I describe an object-oriented software architecture for representing physical problems that includes classes for such common mathematical objects as geometry, boundary conditions, partial differential and integral equations, discretization and numerical solution methods, etc. In practice, a scientific program written using this architecture looks remarkably like the mathematics used to understand the problem, is typically an order of magnitude smaller than traditional FORTRAN or C++ codes, and hence easier to understand, debug, describe, etc. All objects in this architecture are ``network-enabled,'' which means that components of a software solution to a physical problem can be transparently loaded from anywhere on the Internet or other global network. The architecture is expressed as an ``API,'' or application programmers interface specification, with reference embeddings in Java, Python, and C++. A C++ class library for an early version of this API has been implemented for machines ranging from PC's to the IBM SP2, meaning that phidentical codes run on all architectures.
NASA Technical Reports Server (NTRS)
Forman, P.; Moses, K.
1979-01-01
A brief description of a SIFT (Software Implemented Fault Tolerance) Flight Control Computer with emphasis on implementation is presented. A multiprocessor system that relies on software-implemented fault detection and reconfiguration algorithms is described. A high level reliability and fault tolerance is achieved by the replication of computing tasks among processing units.
Adaptation of the anelastic solver EULAG to high performance computing architectures.
NASA Astrophysics Data System (ADS)
Wójcik, Damian; Ciżnicki, Miłosz; Kopta, Piotr; Kulczewski, Michał; Kurowski, Krzysztof; Piotrowski, Zbigniew; Rojek, Krzysztof; Rosa, Bogdan; Szustak, Łukasz; Wyrzykowski, Roman
2014-05-01
In recent years there has been widespread interest in employing heterogeneous and hybrid supercomputing architectures for geophysical research. Especially promising application for the modern supercomputing architectures is the numerical weather prediction (NWP). Adopting traditional NWP codes to the new machines based on multi- and many-core processors, such as GPUs allows to increase computational efficiency and decrease energy consumption. This offers unique opportunity to develop simulations with finer grid resolutions and computational domains larger than ever before. Further, it enables to extend the range of scales represented in the model so that the accuracy of representation of the simulated atmospheric processes can be improved. Consequently, it allows to improve quality of weather forecasts. Coalition of Polish scientific institutions launched a project aimed at adopting EULAG fluid solver for future high-performance computing platforms. EULAG is currently being implemented as a new dynamical core of COSMO Consortium weather prediction framework. The solver code combines features of a stencil and point wise computations. Its communication scheme consists of both halo exchange subroutines and global reduction functions. Within the project, two main modules of EULAG, namely MPDATA advection and iterative GCR elliptic solver are analyzed and optimized. Relevant techniques have been chosen and applied to accelerate code execution on modern HPC architectures: stencil decomposition, block decomposition (with weighting analysis between computation and communication), reduction of inter-cache communication by partitioning of cores into independent teams, cache reusing and vectorization. Experiments with matching computational domain topology to cluster topology are performed as well. The parallel formulation was extended from pure MPI to hybrid MPI - OpenMP approach. Porting to GPU using CUDA directives is in progress. Preliminary results of performance of the
Cloud identification using genetic algorithms and massively parallel computation
NASA Technical Reports Server (NTRS)
Buckles, Bill P.; Petry, Frederick E.
1996-01-01
As a Guest Computational Investigator under the NASA administered component of the High Performance Computing and Communication Program, we implemented a massively parallel genetic algorithm on the MasPar SIMD computer. Experiments were conducted using Earth Science data in the domains of meteorology and oceanography. Results obtained in these domains are competitive with, and in most cases better than, similar problems solved using other methods. In the meteorological domain, we chose to identify clouds using AVHRR spectral data. Four cloud speciations were used although most researchers settle for three. Results were remarkedly consistent across all tests (91% accuracy). Refinements of this method may lead to more timely and complete information for Global Circulation Models (GCMS) that are prevalent in weather forecasting and global environment studies. In the oceanographic domain, we chose to identify ocean currents from a spectrometer having similar characteristics to AVHRR. Here the results were mixed (60% to 80% accuracy). Given that one is willing to run the experiment several times (say 10), then it is acceptable to claim the higher accuracy rating. This problem has never been successfully automated. Therefore, these results are encouraging even though less impressive than the cloud experiment. Successful conclusion of an automated ocean current detection system would impact coastal fishing, naval tactics, and the study of micro-climates. Finally we contributed to the basic knowledge of GA (genetic algorithm) behavior in parallel environments. We developed better knowledge of the use of subpopulations in the context of shared breeding pools and the migration of individuals. Rigorous experiments were conducted based on quantifiable performance criteria. While much of the work confirmed current wisdom, for the first time we were able to submit conclusive evidence. The software developed under this grant was placed in the public domain. An extensive user
Synchronized computational architecture for generalized bilateral control of robot arms
NASA Technical Reports Server (NTRS)
Szakaly, Zoltan F. (Inventor)
1991-01-01
A master six degree of freedom Force Reflecting Hand Controller (FRHC) is available at a master site where a received image displays, in essentially real time, a remote robotic manipulator which is being controlled in the corresponding six degree freedom by command signals which are transmitted to the remote site in accordance with the movement of the FRHC at the master site. Software is user-initiated at the master site in order to establish the basic system conditions, and then a physical movement of the FRHC in Cartesean space is reflected at the master site by six absolute numbers that are sensed, translated and computed as a difference signal relative to the earlier position. The change in position is then transmitted in that differential signal form over a high speed synchronized bilateral communication channel which simultaneously returns robot-sensed response information to the master site as forces applied to the FRHC so that the FRHC reflects the feel of what is taking place at the remote site. A system wide clock rate is selected at a sufficiently high rate that the operator at the master site experiences the Force Reflecting operation in real time.
A Cerebellar Neuroprosthetic System: Computational Architecture and in vivo Test
Herreros, Ivan; Giovannucci, Andrea; Taub, Aryeh H.; Hogri, Roni; Magal, Ari; Bamford, Sim; Prueckl, Robert; Verschure, Paul F. M. J.
2014-01-01
Emulating the input–output functions performed by a brain structure opens the possibility for developing neuroprosthetic systems that replace damaged neuronal circuits. Here, we demonstrate the feasibility of this approach by replacing the cerebellar circuit responsible for the acquisition and extinction of motor memories. Specifically, we show that a rat can undergo acquisition, retention, and extinction of the eye-blink reflex even though the biological circuit responsible for this task has been chemically inactivated via anesthesia. This is achieved by first developing a computational model of the cerebellar microcircuit involved in the acquisition of conditioned reflexes and training it with synthetic data generated based on physiological recordings. Secondly, the cerebellar model is interfaced with the brain of an anesthetized rat, connecting the model’s inputs and outputs to afferent and efferent cerebellar structures. As a result, we show that the anesthetized rat, equipped with our neuroprosthetic system, can be classically conditioned to the acquisition of an eye-blink response. However, non-stationarities in the recorded biological signals limit the performance of the cerebellar model. Thus, we introduce an updated cerebellar model and validate it with physiological recordings showing that learning becomes stable and reliable. The resulting system represents an important step toward replacing lost functions of the central nervous system via neuroprosthetics, obtained by integrating a synthetic circuit with the afferent and efferent pathways of a damaged brain region. These results also embody an early example of science-based medicine, where on the one hand the neuroprosthetic system directly validates a theory of cerebellar learning that informed the design of the system, and on the other one it takes a step toward the development of neuro-prostheses that could recover lost learning functions in animals and, in the longer term, humans. PMID:25152887
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
NASA Technical Reports Server (NTRS)
Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.
1973-01-01
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
CRADA ORNL 91-0046B final report: Assessment of IBM advanced computing architectures
Geist, G.A.
1996-02-01
This was a Cooperative Research and Development Agreement (CRADA) with IBM to assess their advanced computer architectures. Over the course of this project three different architectures were evaluated. The POWER/4 RIOS1 based shared memory multiprocessor, the POWER/2 RIOS2 based high performance workstation, and the J30 PowerPC based shared memory multiprocessor. In addition to this hardware several software packages where beta tested for IBM including: ESSO scientific computing library, nv video-conferencing package, Ultimedia multimedia display environment, FORTRAN 90 and C++ compilers, and the AIX 4.1 operating system. Both IBM and ORNL benefited from the research performed in this project and even though access to the POWER/4 computer was delayed several months, all milestones were met.
Scalable quantum computer architecture with coupled donor-quantum dot qubits
Schenkel, Thomas; Lo, Cheuk Chi; Weis, Christoph; Lyon, Stephen; Tyryshkin, Alexei; Bokor, Jeffrey
2014-08-26
A quantum bit computing architecture includes a plurality of single spin memory donor atoms embedded in a semiconductor layer, a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, wherein a first voltage applied across at least one pair of the aligned quantum dot and donor atom controls a donor-quantum dot coupling. A method of performing quantum computing in a scalable architecture quantum computing apparatus includes arranging a pattern of single spin memory donor atoms in a semiconductor layer, forming a plurality of quantum dots arranged with the semiconductor layer and aligned with the donor atoms, applying a first voltage across at least one aligned pair of a quantum dot and donor atom to control a donor-quantum dot coupling, and applying a second voltage between one or more quantum dots to control a Heisenberg exchange J coupling between quantum dots and to cause transport of a single spin polarized electron between quantum dots.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Greedy replica exchange algorithm for heterogeneous computing grids.
Lockhart, Christopher; O'Connor, James; Armentrout, Steven; Klimov, Dmitri K
2015-09-01
Replica exchange molecular dynamics (REMD) has become a valuable tool in studying complex biomolecular systems. However, its application on distributed computing grids is limited by the heterogeneity of this environment. In this study, we propose a REMD implementation referred to as greedy REMD (gREMD) suitable for computations on heterogeneous grids. To decentralize replica management, gREMD utilizes a precomputed schedule of exchange attempts between temperatures. Our comparison of gREMD against standard REMD suggests four main conclusions. First, gREMD accelerates grid REMD simulations by as much as 40 %. Second, gREMD increases CPU utilization rates in grid REMD by up to 60 %. Third, we argue that gREMD is expected to maintain approximately constant CPU utilization rates and simulation wall-clock times with the increase in the number of replicas. Finally, we show that gREMD correctly implements the REMD algorithm and reproduces the conformational ensemble of a short peptide sampled in our previous standard REMD simulations. We believe that gREMD can find its place in large-scale REMD simulations on heterogeneous computing grids. PMID:26311229
Greedy replica exchange algorithm for heterogeneous computing grids.
Lockhart, Christopher; O'Connor, James; Armentrout, Steven; Klimov, Dmitri K
2015-09-01
Replica exchange molecular dynamics (REMD) has become a valuable tool in studying complex biomolecular systems. However, its application on distributed computing grids is limited by the heterogeneity of this environment. In this study, we propose a REMD implementation referred to as greedy REMD (gREMD) suitable for computations on heterogeneous grids. To decentralize replica management, gREMD utilizes a precomputed schedule of exchange attempts between temperatures. Our comparison of gREMD against standard REMD suggests four main conclusions. First, gREMD accelerates grid REMD simulations by as much as 40 %. Second, gREMD increases CPU utilization rates in grid REMD by up to 60 %. Third, we argue that gREMD is expected to maintain approximately constant CPU utilization rates and simulation wall-clock times with the increase in the number of replicas. Finally, we show that gREMD correctly implements the REMD algorithm and reproduces the conformational ensemble of a short peptide sampled in our previous standard REMD simulations. We believe that gREMD can find its place in large-scale REMD simulations on heterogeneous computing grids.
Missing wedge computed tomography by iterative algorithm DIRECTT.
Kupsch, Andreas; Lange, Axel; Hentschel, Manfred P; Lück, Sebastian; Schmidt, Volker; Grothausmann, Roman; Hilger, André; Manke, Ingo
2015-01-01
A strategy to mitigate typical reconstruction artefacts in missing wedge computed tomography is presented. These artefacts appear as elongations of reconstructed details along the mean direction (i.e. the symmetry centre of the projections). Although absent in standard computed tomography applications, they are most prominent in advanced electron tomography and also in special topics of X-ray and neutron tomography under restricted geometric boundary conditions. We investigate the performance of the DIRECTT (Direct Iterative Reconstruction of Computed Tomography Trajectories) algorithm to reduce the directional artefacts in standard procedures. In order to be sensitive to the anisotropic nature of missing wedge artefacts, we investigate isotropic substructures of metal foam as well as circular disc models. Comparison is drawn to filtered backprojection and algebraic techniques. Reference is made to reconstructions of complete data sets. For the purpose of assessing the reconstruction quality, Fourier transforms are employed to visualize the missing wedge directly. Deficient reconstructions of disc models are evaluated by a length-weighted kernel density estimation, which yields the probabilities of boundary orientations. The DIRECTT results are assessed at different signal-to-noise ratios by means of local and integral evaluation parameters. PMID:26367127
Advanced entry guidance algorithm with landing footprint computation
NASA Astrophysics Data System (ADS)
Leavitt, James Aaron
The design and performance evaluation of an entry guidance algorithm for future space transportation vehicles is presented. The algorithm performs two functions: on-board trajectory planning and trajectory tracking. The planned longitudinal path is followed by tracking drag acceleration, as is done by the Space Shuttle entry guidance. Unlike the Shuttle entry guidance, lateral path curvature is also planned and followed. A new trajectory planning function for the guidance algorithm is developed that is suitable for suborbital entry and that significantly enhances the overall performance of the algorithm for both orbital and suborbital entry. In comparison with the previous trajectory planner, the new planner produces trajectories that are easier to track, especially near the upper and lower drag boundaries and for suborbital entry. The new planner accomplishes this by matching the vehicle's initial flight path angle and bank angle, and by enforcing the full three-degree-of-freedom equations of motion with control derivative limits. Insights gained from trajectory optimization results contribute to the design of the new planner, giving it near-optimal downrange and crossrange capabilities. Planned trajectories and guidance simulation results are presented that demonstrate the improved performance. Based on the new planner, a method is developed for approximating the landing footprint for entry vehicles in near real-time, as would be needed for an on-board flight management system. The boundary of the footprint is constructed from the endpoints of extreme downrange and crossrange trajectories generated by the new trajectory planner. The footprint algorithm inherently possesses many of the qualities of the new planner, including quick execution, the ability to accurately approximate the vehicle's glide capabilities, and applicability to a wide range of entry conditions. Footprints can be generated for orbital and suborbital entry conditions using a pre
NASA Astrophysics Data System (ADS)
Liu, Lei; Hong, Xiaobin; Wu, Jian; Lin, Jintong
As Grid computing continues to gain popularity in the industry and research community, it also attracts more attention from the customer level. The large number of users and high frequency of job requests in the consumer market make it challenging. Clearly, all the current Client/Server(C/S)-based architecture will become unfeasible for supporting large-scale Grid applications due to its poor scalability and poor fault-tolerance. In this paper, based on our previous works [1, 2], a novel self-organized architecture to realize a highly scalable and flexible platform for Grids is proposed. Experimental results show that this architecture is suitable and efficient for consumer-oriented Grids.
Paralel Multiphysics Algorithms and Software for Computational Nuclear Engineering
D. Gaston; G. Hansen; S. Kadioglu; D. A. Knoll; C. Newman; H. Park; C. Permann; W. Taitano
2009-08-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple 'code coupling' or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
Parallel multiphysics algorithms and software for computational nuclear engineering
NASA Astrophysics Data System (ADS)
Gaston, D.; Hansen, G.; Kadioglu, S.; Knoll, D. A.; Newman, C.; Park, H.; Permann, C.; Taitano, W.
2009-07-01
There is a growing trend in nuclear reactor simulation to consider multiphysics problems. This can be seen in reactor analysis where analysts are interested in coupled flow, heat transfer and neutronics, and in fuel performance simulation where analysts are interested in thermomechanics with contact coupled to species transport and chemistry. These more ambitious simulations usually motivate some level of parallel computing. Many of the coupling efforts to date utilize simple code coupling or first-order operator splitting, often referred to as loose coupling. While these approaches can produce answers, they usually leave questions of accuracy and stability unanswered. Additionally, the different physics often reside on separate grids which are coupled via simple interpolation, again leaving open questions of stability and accuracy. Utilizing state of the art mathematics and software development techniques we are deploying next generation tools for nuclear engineering applications. The Jacobian-free Newton-Krylov (JFNK) method combined with physics-based preconditioning provide the underlying mathematical structure for our tools. JFNK is understood to be a modern multiphysics algorithm, but we are also utilizing its unique properties as a scale bridging algorithm. To facilitate rapid development of multiphysics applications we have developed the Multiphysics Object-Oriented Simulation Environment (MOOSE). Examples from two MOOSE-based applications: PRONGHORN, our multiphysics gas cooled reactor simulation tool and BISON, our multiphysics, multiscale fuel performance simulation tool will be presented.
CPU SIM: A Computer Simulator for Use in an Introductory Computer Organization-Architecture Class.
ERIC Educational Resources Information Center
Skrein, Dale
1994-01-01
CPU SIM, an interactive low-level computer simulation package that runs on the Macintosh computer, is described. The program is designed for instructional use in the first or second year of undergraduate computer science, to teach various features of typical computer organization through hands-on exercises. (MSE)
ICASE Computer Science Program
NASA Technical Reports Server (NTRS)
1985-01-01
The Institute for Computer Applications in Science and Engineering computer science program is discussed in outline form. Information is given on such topics as problem decomposition, algorithm development, programming languages, and parallel architectures.
Delta: An object-oriented finite element code architecture for massively parallel computers
Weatherby, J.R.; Schutt, J.A.; Peery, J.S.; Hogan, R.E.
1996-02-01
Delta is an object-oriented code architecture based on the finite element method which enables simulation of a wide range of engineering mechanics problems in a parallel processing environment. Written in C{sup ++}, Delta is a natural framework for algorithm development and for research involving coupling of mechanics from different Engineering Science disciplines. To enhance flexibility and encourage code reuse, the architecture provides a clean separation of the major aspects of finite element programming. Spatial discretization, temporal discretization, and the solution of linear and nonlinear systems of equations are each implemented separately, independent from the governing field equations. Other attractive features of the Delta architecture include support for constitutive models with internal variables, reusable ``matrix-free`` equation solvers, and support for region-to-region variations in the governing equations and the active degrees of freedom. A demonstration code built from the Delta architecture has been used in two-dimensional and three-dimensional simulations involving dynamic and quasi-static solid mechanics, transient and steady heat transport, and flow in porous media.
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing.
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2015-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively. PMID:26869876
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing.
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2015-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively.
Design and Analysis of a Neuromemristive Reservoir Computing Architecture for Biosignal Processing
Kudithipudi, Dhireesha; Saleh, Qutaiba; Merkel, Cory; Thesing, James; Wysocki, Bryant
2016-01-01
Reservoir computing (RC) is gaining traction in several signal processing domains, owing to its non-linear stateful computation, spatiotemporal encoding, and reduced training complexity over recurrent neural networks (RNNs). Previous studies have shown the effectiveness of software-based RCs for a wide spectrum of applications. A parallel body of work indicates that realizing RNN architectures using custom integrated circuits and reconfigurable hardware platforms yields significant improvements in power and latency. In this research, we propose a neuromemristive RC architecture, with doubly twisted toroidal structure, that is validated for biosignal processing applications. We exploit the device mismatch to implement the random weight distributions within the reservoir and propose mixed-signal subthreshold circuits for energy efficiency. A comprehensive analysis is performed to compare the efficiency of the neuromemristive RC architecture in both digital(reconfigurable) and subthreshold mixed-signal realizations. Both Electroencephalogram (EEG) and Electromyogram (EMG) biosignal benchmarks are used for validating the RC designs. The proposed RC architecture demonstrated an accuracy of 90 and 84% for epileptic seizure detection and EMG prosthetic finger control, respectively. PMID:26869876
A class of parallel algorithms for computation of the manipulator inertia matrix
NASA Technical Reports Server (NTRS)
Fijany, Amir; Bejczy, Antal K.
1989-01-01
Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.
Service-Oriented Architecture for NVO and TeraGrid Computing
NASA Technical Reports Server (NTRS)
Jacob, Joseph; Miller, Craig; Williams, Roy; Steenberg, Conrad; Graham, Matthew
2008-01-01
The National Virtual Observatory (NVO) Extensible Secure Scalable Service Infrastructure (NESSSI) is a Web service architecture and software framework that enables Web-based astronomical data publishing and processing on grid computers such as the National Science Foundation's TeraGrid. Characteristics of this architecture include the following: (1) Services are created, managed, and upgraded by their developers, who are trusted users of computing platforms on which the services are deployed. (2) Service jobs can be initiated by means of Java or Python client programs run on a command line or with Web portals. (3) Access is granted within a graduated security scheme in which the size of a job that can be initiated depends on the level of authentication of the user.
Development Of A Three-Dimensional Circuit Integration Technology And Computer Architecture
NASA Astrophysics Data System (ADS)
Etchells, R. D.; Grinberg, J.; Nudd, G. R.
1981-12-01
This paper is the first of a series 1,2,3 describing a range of efforts at Hughes Research Laboratories, which are collectively referred to as "Three-Dimensional Microelectronics." The technology being developed is a combination of a unique circuit fabrication/packaging technology and a novel processing architecture. The packaging technology greatly reduces the parasitic impedances associated with signal-routing in complex VLSI structures, while simultaneously allowing circuit densities orders of magnitude higher than the current state-of-the-art. When combined with the 3-D processor architecture, the resulting machine exhibits a one- to two-order of magnitude simultaneous improvement over current state-of-the-art machines in the three areas of processing speed, power consumption, and physical volume. The 3-D architecture is essentially that commonly referred to as a "cellular array", with the ultimate implementation having as many as 512 x 512 processors working in parallel. The three-dimensional nature of the assembled machine arises from the fact that the chips containing the active circuitry of the processor are stacked on top of each other. In this structure, electrical signals are passed vertically through the chips via thermomigrated aluminum feedthroughs. Signals are passed between adjacent chips by micro-interconnects. This discussion presents a broad view of the total effort, as well as a more detailed treatment of the fabrication and packaging technologies themselves. The results of performance simulations of the completed 3-D processor executing a variety of algorithms are also presented. Of particular pertinence to the interests of the focal-plane array community is the simulation of the UNICORNS nonuniformity correction algorithms as executed by the 3-D architecture.
ASIC-based architecture for the real-time computation of 2D convolution with large kernel size
NASA Astrophysics Data System (ADS)
Shao, Rui; Zhong, Sheng; Yan, Luxin
2015-12-01
Bidimensional convolution is a low-level processing algorithm of interest in many areas, but its high computational cost constrains the size of the kernels, especially in real-time embedded systems. This paper presents a hardware architecture for the ASIC-based implementation of 2-D convolution with medium-large kernels. Aiming to improve the efficiency of storage resources on-chip, reducing off-chip bandwidth of these two issues, proposed construction of a data cache reuse. Multi-block SPRAM to cross cached images and the on-chip ping-pong operation takes full advantage of the data convolution calculation reuse, design a new ASIC data scheduling scheme and overall architecture. Experimental results show that the structure can achieve 40× 32 size of template real-time convolution operations, and improve the utilization of on-chip memory bandwidth and on-chip memory resources, the experimental results show that the structure satisfies the conditions to maximize data throughput output , reducing the need for off-chip memory bandwidth.
Chappard, D; Legrand, E; Haettich, B; Chalès, G; Auvinet, B; Eschard, J P; Hamelin, J P; Baslé, M F; Audran, M
2001-11-01
Trabecular bone has been reported as having two-dimensional (2-D) fractal characteristics at the histological level, a finding correlated with biomechanical properties. However, several fractal dimensions (D) are known and computational ways to obtain them vary considerably. This study compared three algorithms on the same series of bone biopsies, to obtain the Kolmogorov, Minkowski-Bouligand, and mass-radius fractal dimensions. The relationships with histomorphometric descriptors of the 2-D trabecular architecture were investigated. Bone biopsies were obtained from 148 osteoporotic male patients. Bone volume (BV/TV), trabecular characteristics (Tb.N, Tb.Sp, Tb.Th), strut analysis, star volumes (marrow spaces and trabeculae), inter-connectivity index, and Euler-Poincaré number were computed. The box-counting method was used to obtain the Kolmogorov dimension (D(k)), the dilatation method for the Minkowski-Bouligand dimension (D(MB)), and the sandbox for the mass-radius dimension (D(MR)) and lacunarity (L). Logarithmic relationships were observed between BV/TV and the fractal dimensions. The best correlation was obtained with D(MR) and the lowest with D(MB). Lacunarity was correlated with descriptors of the marrow cavities (ICI, star volume, Tb.Sp). Linear relationships were observed among the three fractal techniques which appeared highly correlated. A cluster analysis of all histomorphometric parameters provided a tree with three groups of descriptors: for trabeculae (Tb.Th, strut); for marrow cavities (Euler, ICI, Tb.Sp, star volume, L); and for the complexity of the network (Tb.N and the three D's). A sole fractal dimension cannot be used instead of the classic 2-D descriptors of architecture; D rather reflects the complexity of branching trabeculae. Computation time is also an important determinant when choosing one of these methods.
Chappard, D; Legrand, E; Haettich, B; Chalès, G; Auvinet, B; Eschard, J P; Hamelin, J P; Baslé, M F; Audran, M
2001-11-01
Trabecular bone has been reported as having two-dimensional (2-D) fractal characteristics at the histological level, a finding correlated with biomechanical properties. However, several fractal dimensions (D) are known and computational ways to obtain them vary considerably. This study compared three algorithms on the same series of bone biopsies, to obtain the Kolmogorov, Minkowski-Bouligand, and mass-radius fractal dimensions. The relationships with histomorphometric descriptors of the 2-D trabecular architecture were investigated. Bone biopsies were obtained from 148 osteoporotic male patients. Bone volume (BV/TV), trabecular characteristics (Tb.N, Tb.Sp, Tb.Th), strut analysis, star volumes (marrow spaces and trabeculae), inter-connectivity index, and Euler-Poincaré number were computed. The box-counting method was used to obtain the Kolmogorov dimension (D(k)), the dilatation method for the Minkowski-Bouligand dimension (D(MB)), and the sandbox for the mass-radius dimension (D(MR)) and lacunarity (L). Logarithmic relationships were observed between BV/TV and the fractal dimensions. The best correlation was obtained with D(MR) and the lowest with D(MB). Lacunarity was correlated with descriptors of the marrow cavities (ICI, star volume, Tb.Sp). Linear relationships were observed among the three fractal techniques which appeared highly correlated. A cluster analysis of all histomorphometric parameters provided a tree with three groups of descriptors: for trabeculae (Tb.Th, strut); for marrow cavities (Euler, ICI, Tb.Sp, star volume, L); and for the complexity of the network (Tb.N and the three D's). A sole fractal dimension cannot be used instead of the classic 2-D descriptors of architecture; D rather reflects the complexity of branching trabeculae. Computation time is also an important determinant when choosing one of these methods. PMID:11745685
Trapezoidal rule quadrature algorithms for MIMD distributed memory computers
Lyness, J.N.; Plowman, S.E.
1994-08-01
An approach to multi-dimensional quadrature, designed to exploit parallel architectures, is described. This involves transforming the integral in such a way that an accurate result is given by the trapezoidal rule; and by evaluating the resulting sum in a manner which may be efficiently implemented on parallel architectures. This approach is to be implemented in the Liverpool NAG transputer library.
Cladé, Thierry; Snyder, Joshua C.
2010-01-01
Clinical trials which use imaging typically require data management and workflow integration across several parties. We identify opportunities for all parties involved to realize benefits with a modular interoperability model based on service-oriented architecture and grid computing principles. We discuss middleware products for implementation of this model, and propose caGrid as an ideal candidate due to its healthcare focus; free, open source license; and mature developer tools and support. PMID:20449775
An Architecture and Supporting Environment of Service-Oriented Computing Based-On Context Awareness
NASA Astrophysics Data System (ADS)
Ma, Tianxiao; Wu, Gang; Huang, Jun
Service-oriented computing (SOC) is emerging to be an important computing paradigm of the next future. Based on context awareness, this paper proposes an architecture of SOC. A definition of the context in open environments such as Internet is given, which is based on ontology. The paper also proposes a supporting environment for the context-aware SOC, which focus on services on-demand composition and context-awareness evolving. A reference implementation of the supporting environment based on OSGi[11] is given at last.
CCARES: A computer algorithm for the reliability analysis of laminated CMC components
NASA Technical Reports Server (NTRS)
Duffy, Stephen F.; Gyekenyesi, John P.
1993-01-01
Structural components produced from laminated CMC (ceramic matrix composite) materials are being considered for a broad range of aerospace applications that include various structural components for the national aerospace plane, the space shuttle main engine, and advanced gas turbines. Specifically, these applications include segmented engine liners, small missile engine turbine rotors, and exhaust nozzles. Use of these materials allows for improvements in fuel efficiency due to increased engine temperatures and pressures, which in turn generate more power and thrust. Furthermore, this class of materials offers significant potential for raising the thrust-to-weight ratio of gas turbine engines by tailoring directions of high specific reliability. The emerging composite systems, particularly those with silicon nitride or silicon carbide matrix, can compete with metals in many demanding applications. Laminated CMC prototypes have already demonstrated functional capabilities at temperatures approaching 1400 C, which is well beyond the operational limits of most metallic materials. Laminated CMC material systems have several mechanical characteristics which must be carefully considered in the design process. Test bed software programs are needed that incorporate stochastic design concepts that are user friendly, computationally efficient, and have flexible architectures that readily incorporate changes in design philosophy. The CCARES (Composite Ceramics Analysis and Reliability Evaluation of Structures) program is representative of an effort to fill this need. CCARES is a public domain computer algorithm, coupled to a general purpose finite element program, which predicts the fast fracture reliability of a structural component under multiaxial loading conditions.
Architectural Aspects of Grid Computing and its Global Prospects for E-Science Community
NASA Astrophysics Data System (ADS)
Ahmad, Mushtaq
2008-05-01
The paper reviews the imminent Architectural Aspects of Grid Computing for e-Science community for scientific research and business/commercial collaboration beyond physical boundaries. Grid Computing provides all the needed facilities; hardware, software, communication interfaces, high speed internet, safe authentication and secure environment for collaboration of research projects around the globe. It provides highly fast compute engine for those scientific and engineering research projects and business/commercial applications which are heavily compute intensive and/or require humongous amounts of data. It also makes possible the use of very advanced methodologies, simulation models, expert systems and treasure of knowledge available around the globe under the umbrella of knowledge sharing. Thus it makes possible one of the dreams of global village for the benefit of e-Science community across the globe.
New algorithms for the symmetric tridiagonal eigenvalue computation
Pan, V. |
1994-12-31
The author presents new algorithms that accelerate the bisection method for the symmetric eigenvalue problem. The algorithms rely on some new techniques, which include acceleration of Newton`s iteration and can also be further applied to acceleration of some other iterative processes, in particular, of iterative algorithms for approximating polynomial zeros.
ERIC Educational Resources Information Center
Arumi, Francisco N.
Computer programs capable of describing the thermal behavior of buildings are used to help architectural students understand environmental systems. The Numerical Simulation Laboratory at the Architectural School of the University of Texas at Austin was developed to provide the necessary software capable of simulating the energy transactions…
Development of new flux splitting schemes. [computational fluid dynamics algorithms
NASA Technical Reports Server (NTRS)
Liou, Meng-Sing; Steffen, Christopher J., Jr.
1992-01-01
Maximizing both accuracy and efficiency has been the primary objective in designing a numerical algorithm for computational fluid dynamics (CFD). This is especially important for solutions of complex three dimensional systems of Navier-Stokes equations which often include turbulence modeling and chemistry effects. Recently, upwind schemes have been well received for their capability in resolving discontinuities. With this in mind, presented are two new flux splitting techniques for upwind differencing. The first method is based on High-Order Polynomial Expansions (HOPE) of the mass flux vector. The second new flux splitting is based on the Advection Upwind Splitting Method (AUSM). The calculation of the hypersonic conical flow demonstrates the accuracy of the splitting in resolving the flow in the presence of strong gradients. A second series of tests involving the two dimensional inviscid flow over a NACA 0012 airfoil demonstrates the ability of the AUSM to resolve the shock discontinuity at transonic speed. A third case calculates a series of supersonic flows over a circular cylinder. Finally, the fourth case deals with tests of a two dimensional shock wave/boundary layer interaction.
Computational Performance Assessment of k-mer Counting Algorithms.
Pérez, Nelson; Gutierrez, Miguel; Vera, Nelson
2016-04-01
This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables.
Computational Performance Assessment of k-mer Counting Algorithms.
Pérez, Nelson; Gutierrez, Miguel; Vera, Nelson
2016-04-01
This article is about the assessment of several tools for k-mer counting, with the purpose to create a reference framework for bioinformatics researchers to identify computational requirements, parallelizing, advantages, disadvantages, and bottlenecks of each of the algorithms proposed in the tools. The k-mer counters evaluated in this article were BFCounter, DSK, Jellyfish, KAnalyze, KHMer, KMC2, MSPKmerCounter, Tallymer, and Turtle. Measured parameters were the following: RAM occupied space, processing time, parallelization, and read and write disk access. A dataset consisting of 36,504,800 reads was used corresponding to the 14th human chromosome. The assessment was performed for two k-mer lengths: 31 and 55. Obtained results were the following: pure Bloom filter-based tools and disk-partitioning techniques showed a lesser RAM use. The tools that took less execution time were the ones that used disk-partitioning techniques. The techniques that made the major parallelization were the ones that used disk partitioning, hash tables with lock-free approach, or multiple hash tables. PMID:26982880
Computational Tools and Algorithms for Designing Customized Synthetic Genes
Gould, Nathan; Hendy, Oliver; Papamichail, Dimitris
2014-01-01
Advances in DNA synthesis have enabled the construction of artificial genes, gene circuits, and genomes of bacterial scale. Freedom in de novo design of synthetic constructs provides significant power in studying the impact of mutations in sequence features, and verifying hypotheses on the functional information that is encoded in nucleic and amino acids. To aid this goal, a large number of software tools of variable sophistication have been implemented, enabling the design of synthetic genes for sequence optimization based on rationally defined properties. The first generation of tools dealt predominantly with singular objectives such as codon usage optimization and unique restriction site incorporation. Recent years have seen the emergence of sequence design tools that aim to evolve sequences toward combinations of objectives. The design of optimal protein-coding sequences adhering to multiple objectives is computationally hard, and most tools rely on heuristics to sample the vast sequence design space. In this review, we study some of the algorithmic issues behind gene optimization and the approaches that different tools have adopted to redesign genes and optimize desired coding features. We utilize test cases to demonstrate the efficiency of each approach, as well as identify their strengths and limitations. PMID:25340050
A constrained conjugate gradient algorithm for computed tomography
Azevedo, S.G.; Goodman, D.M.
1994-11-15
Image reconstruction from projections of x-ray, gamma-ray, protons and other penetrating radiation is a well-known problem in a variety of fields, and is commonly referred to as computed tomography (CT). Various analytical and series expansion methods of reconstruction and been used in the past to provide three-dimensional (3D) views of some interior quantity. The difficulties of these approaches lie in the cases where (a) the number of views attainable is limited, (b) the Poisson (or other) uncertainties are significant, (c) quantifiable knowledge of the object is available, but not implementable, or (d) other limitations of the data exist. We have adapted a novel nonlinear optimization procedure developed at LLNL to address limited-data image reconstruction problems. The technique, known as nonlinear least squares with general constraints or constrained conjugate gradients (CCG), has been successfully applied to a number of signal and image processing problems, and is now of great interest to the image reconstruction community. Previous applications of this algorithm to deconvolution problems and x-ray diffraction images for crystallography have shown the great promise.
Algorithms for computer detection of symmetry elements in molecular systems.
Beruski, Otávio; Vidal, Luciano N
2014-02-01
Simple procedures for the location of proper and improper rotations and reflexion planes are presented. The search is performed with a molecule divided into subsets of symmetrically equivalent atoms (SEA) which are analyzed separately as if they were a single molecule. This approach is advantageous in many aspects. For instance, in those molecules that are symmetric rotors, the number of atoms and the inertia tensor of the SEA provide one straight way to find proper rotations of any order. The algorithms are invariant to the molecular orientation and their computational cost is low, because the main information required to find symmetry elements is interatomic distances and the principal moments of the SEA. For example, our Fortran implementation, running on a single processor, took only a few seconds to locate all 120 symmetry operations of the large and highly symmetrical fullerene C720, belonging to the Ih point group. Finally, we show how the interatomic distances matrix of a slightly unsymmetrical molecule is used to symmetrize its geometry. PMID:24403016
Efficient algorithm to compute mutually connected components in interdependent networks.
Hwang, S; Choi, S; Lee, Deokjae; Kahng, B
2015-02-01
Mutually connected components (MCCs) play an important role as a measure of resilience in the study of interdependent networks. Despite their importance, an efficient algorithm to obtain the statistics of all MCCs during the removal of links has thus far been absent. Here, using a well-known fully dynamic graph algorithm, we propose an efficient algorithm to accomplish this task. We show that the time complexity of this algorithm is approximately O(N(1.2)) for random graphs, which is more efficient than O(N(2)) of the brute-force algorithm. We confirm the correctness of our algorithm by comparing the behavior of the order parameter as links are removed with existing results for three types of double-layer multiplex networks. We anticipate that this algorithm will be used for simulations of large-size systems that have been previously inaccessible. PMID:25768559
Computationally efficient algorithm for high sampling-frequency operation of active noise control
NASA Astrophysics Data System (ADS)
Rout, Nirmal Kumar; Das, Debi Prasad; Panda, Ganapati
2015-05-01
In high sampling-frequency operation of active noise control (ANC) system the length of the secondary path estimate and the ANC filter are very long. This increases the computational complexity of the conventional filtered-x least mean square (FXLMS) algorithm. To reduce the computational complexity of long order ANC system using FXLMS algorithm, frequency domain block ANC algorithms have been proposed in past. These full block frequency domain ANC algorithms are associated with some disadvantages such as large block delay, quantization error due to computation of large size transforms and implementation difficulties in existing low-end DSP hardware. To overcome these shortcomings, the partitioned block ANC algorithm is newly proposed where the long length filters in ANC are divided into a number of equal partitions and suitably assembled to perform the FXLMS algorithm in the frequency domain. The complexity of this proposed frequency domain partitioned block FXLMS (FPBFXLMS) algorithm is quite reduced compared to the conventional FXLMS algorithm. It is further reduced by merging one fast Fourier transform (FFT)-inverse fast Fourier transform (IFFT) combination to derive the reduced structure FPBFXLMS (RFPBFXLMS) algorithm. Computational complexity analysis for different orders of filter and partition size are presented. Systematic computer simulations are carried out for both the proposed partitioned block ANC algorithms to show its accuracy compared to the time domain FXLMS algorithm.
Quantum computation: algorithms and implementation in quantum dot devices
NASA Astrophysics Data System (ADS)
Gamble, John King
In this thesis, we explore several aspects of both the software and hardware of quantum computation. First, we examine the computational power of multi-particle quantum random walks in terms of distinguishing mathematical graphs. We study both interacting and non-interacting multi-particle walks on strongly regular graphs, proving some limitations on distinguishing powers and presenting extensive numerical evidence indicative of interactions providing more distinguishing power. We then study the recently proposed adiabatic quantum algorithm for Google PageRank, and show that it exhibits power-law scaling for realistic WWW-like graphs. Turning to hardware, we next analyze the thermal physics of two nearby 2D electron gas (2DEG), and show that an analogue of the Coulomb drag effect exists for heat transfer. In some distance and temperature, this heat transfer is more significant than phonon dissipation channels. After that, we study the dephasing of two-electron states in a single silicon quantum dot. Specifically, we consider dephasing due to the electron-phonon coupling and charge noise, separately treating orbital and valley excitations. In an ideal system, dephasing due to charge noise is strongly suppressed due to a vanishing dipole moment. However, introduction of disorder or anharmonicity leads to large effective dipole moments, and hence possibly strong dephasing. Building on this work, we next consider more realistic systems, including structural disorder systems. We present experiment and theory, which demonstrate energy levels that vary with quantum dot translation, implying a structurally disordered system. Finally, we turn to the issues of valley mixing and valley-orbit hybridization, which occurs due to atomic-scale disorder at quantum well interfaces. We develop a new theoretical approach to study these effects, which we name the disorder-expansion technique. We demonstrate that this method successfully reproduces atomistic tight-binding techniques
NASA Astrophysics Data System (ADS)
Jolly, Sundeep; Smalley, Daniel; Barabas, James; Bove, V. Michael
2014-02-01
The MIT Mark IV holographic display system employs a novel anisotropic leaky-mode spatial light modulator that allows for the simultaneous and superimposed modulation of red, green, and blue light via wavelength-division multiplexing. This WDM-based scheme for full-color display requires that incoming video signals containing holographic fringe information are comprised of non-overlapping spectral bands that fall within the available 200 MHz output bandwidth of commercial GPUs. These bands correspond to independent color channels in the display output and are appropriately band-limited and centered to match the multiplexed passbands and center frequencies in the frequency response of the mode-coupling device. The computational architecture presented in this paper involves the computation of holographic fringe patterns for each color channel and their summation in generating a single video signal for input to the display. In composite, 18 such input signals, each containing holographic fringe information for 26 horizontal-parallax only holographic lines, are generated via three dual-head GPUs for a total of 468 holographic lines in the display output. We present a general scheme for full-color CGH computation for input to Mark IV and furthermore depict the adaptation of the diffraction specific coherent panoramagram approach to fringe computation for the Mark IV architecture.
Azmy, Yousry
2014-06-10
We employ the Integral Transport Matrix Method (ITMM) as the kernel of new parallel solution methods for the discrete ordinates approximation of the within-group neutron transport equation. The ITMM abandons the repetitive mesh sweeps of the traditional source iterations (SI) scheme in favor of constructing stored operators that account for the direct coupling factors among all the cells' fluxes and between the cells' and boundary surfaces' fluxes. The main goals of this work are to develop the algorithms that construct these operators and employ them in the solution process, determine the most suitable way to parallelize the entire procedure, and evaluate the behavior and parallel performance of the developed methods with increasing number of processes, P. The fastest observed parallel solution method, Parallel Gauss-Seidel (PGS), was used in a weak scaling comparison with the PARTISN transport code, which uses the source iteration (SI) scheme parallelized with the Koch-baker-Alcouffe (KBA) method. Compared to the state-of-the-art SI-KBA with diffusion synthetic acceleration (DSA), this new method- even without acceleration/preconditioning-is completitive for optically thick problems as P is increased to the tens of thousands range. For the most optically thick cells tested, PGS reduced execution time by an approximate factor of three for problems with more than 130 million computational cells on P = 32,768. Moreover, the SI-DSA execution times's trend rises generally more steeply with increasing P than the PGS trend. Furthermore, the PGS method outperforms SI for the periodic heterogeneous layers (PHL) configuration problems. The PGS method outperforms SI and SI-DSA on as few as P = 16 for PHL problems and reduces execution time by a factor of ten or more for all problems considered with more than 2 million computational cells on P = 4.096.
ERIC Educational Resources Information Center
Uwakonye, Obioha; Alagbe, Oluwole; Oluwatayo, Adedapo; Alagbe, Taiye; Alalade, Gbenga
2015-01-01
As a result of globalization of digital technology, intellectual discourse on what constitutes the basic body of architectural knowledge to be imparted to future professionals has been on the increase. This digital revolution has brought to the fore the need to review the already overloaded architectural education curriculum of Nigerian schools of…
NASA Astrophysics Data System (ADS)
Chen, Yufeng; Wu, Zebin; Sun, Le; Wei, Zhihui; Li, Yonglong
2016-04-01
With the gradual increase in the spatial and spectral resolution of hyperspectral images, the size of image data becomes larger and larger, and the complexity of processing algorithms is growing, which poses a big challenge to efficient massive hyperspectral image processing. Cloud computing technologies distribute computing tasks to a large number of computing resources for handling large data sets without the limitation of memory and computing resource of a single machine. This paper proposes a parallel pixel purity index (PPI) algorithm for unmixing massive hyperspectral images based on a MapReduce programming model for the first time in the literature. According to the characteristics of hyperspectral images, we describe the design principle of the algorithm, illustrate the main cloud unmixing processes of PPI, and analyze the time complexity of serial and parallel algorithms. Experimental results demonstrate that the parallel implementation of the PPI algorithm on the cloud can effectively process big hyperspectral data and accelerate the algorithm.
Tools for Analyzing Computing Resource Management Strategies and Algorithms for SDR Clouds
NASA Astrophysics Data System (ADS)
Marojevic, Vuk; Gomez-Miguelez, Ismael; Gelonch, Antoni
2012-09-01
Software defined radio (SDR) clouds centralize the computing resources of base stations. The computing resource pool is shared between radio operators and dynamically loads and unloads digital signal processing chains for providing wireless communications services on demand. Each new user session request particularly requires the allocation of computing resources for executing the corresponding SDR transceivers. The huge amount of computing resources of SDR cloud data centers and the numerous session requests at certain hours of a day require an efficient computing resource management. We propose a hierarchical approach, where the data center is divided in clusters that are managed in a distributed way. This paper presents a set of computing resource management tools for analyzing computing resource management strategies and algorithms for SDR clouds. We use the tools for evaluating a different strategies and algorithms. The results show that more sophisticated algorithms can achieve higher resource occupations and that a tradeoff exists between cluster size and algorithm complexity.
NASA Technical Reports Server (NTRS)
Wright, Mary A.; Regelbrugge, Marc E.; Felippa, Carlos A.
1989-01-01
This is the fourth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 4 describes the nominal-record data management component of the NICE software. It is intended for all users.
Ultra-fast data-mining hardware architecture based on stochastic computing.
Morro, Antoni; Canals, Vincent; Oliver, Antoni; Alomar, Miquel L; Rossello, Josep L
2015-01-01
Minimal hardware implementations able to cope with the processing of large amounts of data in reasonable times are highly desired in our information-driven society. In this work we review the application of stochastic computing to probabilistic-based pattern-recognition analysis of huge database sets. The proposed technique consists in the hardware implementation of a parallel architecture implementing a similarity search of data with respect to different pre-stored categories. We design pulse-based stochastic-logic blocks to obtain an efficient pattern recognition system. The proposed architecture speeds up the screening process of huge databases by a factor of 7 when compared to a conventional digital implementation using the same hardware area.
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the fifth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 5 describes the low-level data management component of the NICE software. It is intended only for advanced programmers involved in maintenance of the software.
Mental Computation or Standard Algorithm? Children's Strategy Choices on Multi-Digit Subtractions
ERIC Educational Resources Information Center
Torbeyns, Joke; Verschaffel, Lieven
2016-01-01
This study analyzed children's use of mental computation strategies and the standard algorithm on multi-digit subtractions. Fifty-eight Flemish 4th graders of varying mathematical achievement level were individually offered subtractions that either stimulated the use of mental computation strategies or the standard algorithm in one choice and two…
VLSI architecture for computer vision based on neurobiological principles of organization
Ginosar, R.; Zeevi, Y.Y.
1988-09-01
Biological and technological (wide-field-of-view) vision systems are confronted with the formidable task of managing data sets at a rate in excess of 10/sup 0/10 bits per second. In both cases, however, considering the required tasks, it appears that the data are highly redundant, and therefore must be reorganized before any type of higher level processing is applied to them. Reorganizations may include compression, and dimensional reduction according to the various relevant parameters. Biological processing at both the retinal and cortical levels often consists of repetitive simple operations applied to spatial and/or temporal neighborhoods, limited in their extend and duration. These are most adequate for the very high image data rates, in spite of the fact that those neurobiological systems actually consists of simple components which are several orders of magnitude slower than electronic components. It is the authors' goal to follow biological algorithms and principles of organization in the design of VLSI architectures, and to achieve similar or better performance in image processing and machine vision. Their efforts have yielded the following families of VLSI devices and systems. A highly parallel Intelligent Scan Image Acquisition VLSI sensing device has been constructed. It selectively scans only the relevant areas of interest in each image, thus effectively providing a compressed image for later processing stages. The device is controlled by an algorithm which is highly sensitive to image content. This sensor imitates the capability of the eye to concentrate on (attend) certain parts of the image, and even extends this by processing multiple focal points simultaneously. This is an example of how we applied the nonuniform sample-and-process algorithm, characteristic of biological vision, in a highly parallel architecture which surpasses the performance of the human eye.
Simple algorithm for computing the geometric measure of entanglement
Streltsov, Alexander; Kampermann, Hermann; Bruss, Dagmar
2011-08-15
We present an easy implementable algorithm for approximating the geometric measure of entanglement from above. The algorithm can be applied to any multipartite mixed state. It involves only the solution of an eigenproblem and finding a singular value decomposition; no further numerical techniques are needed. To provide examples, the algorithm was applied to the isotropic states of three qubits and the three-qubit XX model with external magnetic field.
Apparatuses and Methods for Producing Runtime Architectures of Computer Program Modules
NASA Technical Reports Server (NTRS)
Abi-Antoun, Marwan Elia (Inventor); Aldrich, Jonathan Erik (Inventor)
2013-01-01
Apparatuses and methods for producing run-time architectures of computer program modules. One embodiment includes creating an abstract graph from the computer program module and from containment information corresponding to the computer program module, wherein the abstract graph has nodes including types and objects, and wherein the abstract graph relates an object to a type, and wherein for a specific object the abstract graph relates the specific object to a type containing the specific object; and creating a runtime graph from the abstract graph, wherein the runtime graph is a representation of the true runtime object graph, wherein the runtime graph represents containment information such that, for a specific object, the runtime graph relates the specific object to another object that contains the specific object.
3D-SoftChip: A Novel Architecture for Next-Generation Adaptive Computing Systems
NASA Astrophysics Data System (ADS)
Kim, Chul; Rassau, Alex; Lachowicz, Stefan; Lee, Mike Myung-Ok; Eshraghian, Kamran
2006-12-01
This paper introduces a novel architecture for next-generation adaptive computing systems, which we term 3D-SoftChip. The 3D-SoftChip is a 3-dimensional (3D) vertically integrated adaptive computing system combining state-of-the-art processing and 3D interconnection technology. It comprises the vertical integration of two chips (a configurable array processor and an intelligent configurable switch) through an indium bump interconnection array (IBIA). The configurable array processor (CAP) is an array of heterogeneous processing elements (PEs), while the intelligent configurable switch (ICS) comprises a switch block, 32-bit dedicated RISC processor for control, on-chip program/data memory, data frame buffer, along with a direct memory access (DMA) controller. This paper introduces the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing as a next-generation computing system. The paper further describes the advanced HW/SW codesign and verification methodology, including high-level system modeling of the 3D-SoftChip using SystemC, being used to determine the optimum hardware specification in the early design stage.
A 0.13-µm implementation of 5 Gb/s and 3-mW folded parallel architecture for AES algorithm
NASA Astrophysics Data System (ADS)
Rahimunnisa, K.; Karthigaikumar, P.; Kirubavathy, J.; Jayakumar, J.; Kumar, S. Suresh
2014-02-01
A new architecture for encrypting and decrypting the confidential data using Advanced Encryption Standard algorithm is presented in this article. This structure combines the folded structure with parallel architecture to increase the throughput. The whole architecture achieved high throughput with less power. The proposed architecture is implemented in 0.13-µm Complementary metal-oxide-semiconductor (CMOS) technology. The proposed structure is compared with different existing structures, and from the result it is proved that the proposed structure gives higher throughput and less power compared to existing works.
NASA Astrophysics Data System (ADS)
Xing, Jianchao; Zhang, Jie; Zhao, Yongli; Cao, Xuping; Wang, Dajiang; Gu, Wanyi
2011-12-01
The traditional approach for inter-domain Traffic Engineering Label Switching Path (TE-LSP) computation like BRPC could provide a shortest inter-domain constrained TE-LSP, but under wavelength continuity constraint, it couldn't guarantee the success of the resources reservation for the shortest path. In this paper, a Collision-aware Backward Recursive PCE-based Computation Algorithm (CA-BRPC) in multi-domain optical networks under wavelength continuity constraint is proposed, which is implemented based on Hierarchical PCE (H-PCE) architecture, could provide an optimal inter-domain TE-LSP and avoid resources reservation conflict. Numeric results show that the CA-BRPC could reduce the blocking probability of entire network.
Efficient graph algorithms for sequential and parallel computers. Doctoral thesis
Goldberg, A.V.
1987-02-01
This thesis studies graph algorithms, both in sequential and parallel contexts. In the outline of the thesis, algorithm complexities are stated in terms of the the number of vertices n, the number of edges m, the largest absolute value of capacities U, and the largest absolute value of costs C. Chapter 1 introduces a new approach to the maximum flow problem that leads to better algorithms for the problem. Chapter 2 is devoted to the minimum cost flow problem, which is a generalization of the maximum flow problem. Chapter 3 addresses implementation of parallel algorithms through a case study of an implementation of a parallel maximum flow algorithm. Parallel prefix operations play an important role in the implementation. Present experimental results achieved by the implementation are presented. Present parallel symmetry-breaking techniques are the main topic of Chapter 4.
Disentangling the exchange coupling of entangled donors in the Si quantum computer architecture.
Koiller, Belita; Hu, Xuedong; Drew, H D; Sarma, S Das
2003-02-14
We develop a theory for micro-Raman scattering by single and coupled two-donor states in silicon. We find the Raman spectra to have significant dependence on the donor exchange splitting and the relative spatial positions of the two-donor sites. In particular, we establish a strong correlation between the temperature dependence of the Raman peak intensity and the interdonor exchange coupling. Micro-Raman scattering can therefore potentially become a powerful tool to measure interqubit coupling in the development of a Si quantum computer architecture.
Linnanto, Juha Matti; Freiberg, Arvi
2014-10-06
We have used different computational methods to study structural architecture, and light-harvesting and energy transfer properties of the photosynthetic unit of filamentous anoxygenic phototrophs. Due to the huge number of atoms in the photosynthetic unit, a combination of atomistic and coarse methods was used for electronic structure calculations. The calculations reveal that the light energy absorbed by the peripheral chlorosome antenna complex transfers efficiently via the baseplate and the core B808–866 antenna complexes to the reaction center complex, in general agreement with the present understanding of this complex system.
Selecting Computer Architectures by Means of Control-Flow-Graph Mining
NASA Astrophysics Data System (ADS)
Eichinger, Frank; Böhm, Klemens
Deciding which computer architecture provides the best performance for a certain program is an important problem in hardware design and benchmarking. While previous approaches require expensive simulations or program executions, we propose an approach which solely relies on program analysis. We correlate substructures of the control-flow graphs representing the individual functions with the runtime on certain systems. This leads to a prediction framework based on graph mining, classification and classifier fusion. In our evaluation with the SPEC CPU 2000 and 2006 benchmarks, we predict the faster system out of two with high accuracy and achieve significant speedups in execution time.
NASA Technical Reports Server (NTRS)
Fijany, Amir; Toomarian, Benny N.
2000-01-01
There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA
An Algorithm to Compute Abelian Subalgebras in Linear Algebras of Upper-Triangular Matrices
NASA Astrophysics Data System (ADS)
Ceballos, Manuel; Núñez, Juan; Tenorio, Ángel F.
2009-08-01
This paper deals with the maximal abelian dimension of the Lie algebra hn, of n×n upper-triangular matrices. Regarding this, we obtain an algorithm which computes abelian subalgebras of hn as well as its implementation (and a computational study) by using the symbolic computation package MAPLE, where the order n of the matrices in hn is the unique input needed. Let us note that the algorithm also allows us to obtain a maximal abelian subalgebra of hn.
Amirfattahi, Rassoul
2013-10-01
Owing to its simplicity radix-2 is a popular algorithm to implement fast fourier transform. Radix-2(p) algorithms have the same order of computational complexity as higher radices algorithms, but still retain the simplicity of radix-2. By defining a new concept, twiddle factor template, in this paper, we propose a method for exact calculation of multiplicative complexity for radix-2(p) algorithms. The methodology is described for radix-2, radix-2 (2) and radix-2 (3) algorithms. Results show that radix-2 (2) and radix-2 (3) have significantly less computational complexity compared with radix-2. Another interesting result is that while the number of complex multiplications in radix-2 (3) algorithm is slightly more than radix-2 (2), the number of real multiplications for radix-2 (3) is less than radix-2 (2). This is because of the twiddle factors in the form of which need less number of real multiplications and are more frequent in radix-2 (3) algorithm.
One high-accuracy camera calibration algorithm based on computer vision images
NASA Astrophysics Data System (ADS)
Wang, Ying; Huang, Jianming; Wei, Xiangquan
2015-12-01
Camera calibration is the first step of computer vision and one of the most active research fields nowadays. In order to improve the measurement precision, the internal parameters of the camera should be accurately calibrated. So one high-accuracy camera calibration algorithm is proposed based on the images of planar targets or tridimensional targets. By using the algorithm, the internal parameters of the camera are calibrated based on the existing planar target at the vision-based navigation experiment. The experimental results show that the accuracy of the proposed algorithm is obviously improved compared with the conventional linear algorithm, Tsai general algorithm, and Zhang Zhengyou calibration algorithm. The algorithm proposed by the article can satisfy the need of computer vision and provide reference for precise measurement of the relative position and attitude.
Probabilistic structural analysis algorithm development for computational efficiency
NASA Technical Reports Server (NTRS)
Wu, Y.-T.
1991-01-01
The PSAM (Probabilistic Structural Analysis Methods) program is developing a probabilistic structural risk assessment capability for the SSME components. An advanced probabilistic structural analysis software system, NESSUS (Numerical Evaluation of Stochastic Structures Under Stress), is being developed as part of the PSAM effort to accurately simulate stochastic structures operating under severe random loading conditions. One of the challenges in developing the NESSUS system is the development of the probabilistic algorithms that provide both efficiency and accuracy. The main probability algorithms developed and implemented in the NESSUS system are efficient, but approximate in nature. In the last six years, the algorithms have improved very significantly.
Simulation of Si:P spin-based quantum computer architecture
Chang Yiachung; Fang Angbo
2008-11-07
We present realistic simulation for single and double phosphorous donors in a silicon-based quantum computer design by solving a valley-orbit coupled effective-mass equation for describing phosphorous donors in strained silicon quantum well (QW). Using a generalized unrestricted Hartree-Fock method, we solve the two-electron effective-mass equation with quantum well confinement and realistic gate potentials. The effects of QW width, gate voltages, donor separation, and donor position shift on the lowest singlet and triplet energies and their charge distributions for a neighboring donor pair in the quantum computer(QC) architecture are analyzed. The gate tunability are defined and evaluated for a typical QC design. Estimates are obtained for the duration of spin half-swap gate operation.
Fast algorithm for automatically computing Strahler stream order
Lanfear, Kenneth J.
1990-01-01
An efficient algorithm was developed to determine Strahler stream order for segments of stream networks represented in a Geographic Information System (GIS). The algorithm correctly assigns Strahler stream order in topologically complex situations such as braided streams and multiple drainage outlets. Execution time varies nearly linearly with the number of stream segments in the network. This technique is expected to be particularly useful for studying the topology of dense stream networks derived from digital elevation model data.
Algorithm development for Maxwell's equations for computational electromagnetism
NASA Technical Reports Server (NTRS)
Goorjian, Peter M.
1990-01-01
A new algorithm has been developed for solving Maxwell's equations for the electromagnetic field. It solves the equations in the time domain with central, finite differences. The time advancement is performed implicitly, using an alternating direction implicit procedure. The space discretization is performed with finite volumes, using curvilinear coordinates with electromagnetic components along those directions. Sample calculations are presented of scattering from a metal pin, a square and a circle to demonstrate the capabilities of the new algorithm.
Mixed-radix Algorithm for the Computation of Forward and Inverse MDCT
Wu, Jiasong; Shu, Huazhong; Senhadji, Lotfi; Luo, Limin
2008-01-01
The modified discrete cosine transform (MDCT) and inverse MDCT (IMDCT) are two of the most computational intensive operations in MPEG audio coding standards. A new mixed-radix algorithm for efficient computing the MDCT/IMDCT is presented. The proposed mixed-radix MDCT algorithm is composed of two recursive algorithms. The first algorithm, called the radix-2 decimation in frequency (DIF) algorithm, is obtained by decomposing an N-point MDCT into two MDCTs with the length N/2. The second algorithm, called the radix-3 decimation in time (DIT) algorithm, is obtained by decomposing an N-point MDCT into three MDCTs with the length N/3. Since the proposed MDCT algorithm is also expressed in the form of a simple sparse matrix factorization, the corresponding IMDCT algorithm can be easily derived by simply transposing the matrix factorization. Comparison of the proposed algorithm with some existing ones shows that our proposed algorithm is more suitable for parallel implementation and especially suitable for the layer III of MPEG-1 and MPEG-2 audio encoding and decoding. Moreover, the proposed algorithm can be easily extended to the multidimensional case by using the vector-radix method. PMID:21258639
NASA Astrophysics Data System (ADS)
Zhang, Leihong; Liang, Dong; Li, Bei; Kang, Yi; Pan, Zilan; Zhang, Dawei; Gao, Xiumin; Ma, Xiuhua
2016-07-01
On the basis of analyzing the cosine light field with determined analytic expression and the pseudo-inverse method, the object is illuminated by a presetting light field with a determined discrete Fourier transform measurement matrix, and the object image is reconstructed by the pseudo-inverse method. The analytic expression of the algorithm of computational ghost imaging based on discrete Fourier transform measurement matrix is deduced theoretically, and compared with the algorithm of compressive computational ghost imaging based on random measurement matrix. The reconstruction process and the reconstruction error are analyzed. On this basis, the simulation is done to verify the theoretical analysis. When the sampling measurement number is similar to the number of object pixel, the rank of discrete Fourier transform matrix is the same as the one of the random measurement matrix, the PSNR of the reconstruction image of FGI algorithm and PGI algorithm are similar, the reconstruction error of the traditional CGI algorithm is lower than that of reconstruction image based on FGI algorithm and PGI algorithm. As the decreasing of the number of sampling measurement, the PSNR of reconstruction image based on FGI algorithm decreases slowly, and the PSNR of reconstruction image based on PGI algorithm and CGI algorithm decreases sharply. The reconstruction time of FGI algorithm is lower than that of other algorithms and is not affected by the number of sampling measurement. The FGI algorithm can effectively filter out the random white noise through a low-pass filter and realize the reconstruction denoising which has a higher denoising capability than that of the CGI algorithm. The FGI algorithm can improve the reconstruction accuracy and the reconstruction speed of computational ghost imaging.
Peer-to-peer architectures for exascale computing : LDRD final report.
Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.; Armstrong, Robert C.; Rudish, Donald W.
2010-09-01
The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2
Olson, Wilma K.; Grosner, Michael A.; Czapla, Luke; Swigon, David
2013-01-01
Bacterial gene expression is regulated by DNA elements that often lie far apart along the genomic sequence but come close together during genetic processing. The intervening residues form loops, which are organized by the binding of various proteins. For example, the Escherichia coli Lac repressor protein binds DNA operators, separated by 92 or 401 base pairs, and suppresses the formation of gene products involved in the metabolism of lactose. The system also includes several highly abundant architectural proteins, such as the histone-like (heat unstable) HU protein, which severely deform the double helix upon binding. In order to gain a better understanding of how the naturally stiff DNA double helix forms the short loops detected in vivo, we have developed new computational methods to study the effects of various non-specifically binding proteins on the three-dimensional configurational properties of DNA sequences. This article surveys the approach that we use to generate ensembles of spatially constrained protein-decorated DNA structures (minicircles and Lac repressor-mediated loops) and presents some of the insights gained from the correspondence between computation and experiment about the potential contributions of architectural and regulatory proteins to DNA looping and gene expression. PMID:23514154
Parallel algorithms and archtectures for computational structural mechanics
NASA Technical Reports Server (NTRS)
Patrick, Merrell; Ma, Shing; Mahajan, Umesh
1989-01-01
The determination of the fundamental (lowest) natural vibration frequencies and associated mode shapes is a key step used to uncover and correct potential failures or problem areas in most complex structures. However, the computation time taken by finite element codes to evaluate these natural frequencies is significant, often the most computationally intensive part of structural analysis calculations. There is continuing need to reduce this computation time. This study addresses this need by developing methods for parallel computation.
Topics in Computational Learning Theory and Graph Algorithms.
ERIC Educational Resources Information Center
Board, Raymond Acton
This thesis addresses problems from two areas of theoretical computer science. The first area is that of computational learning theory, which is the study of the phenomenon of concept learning using formal mathematical models. The goal of computational learning theory is to investigate learning in a rigorous manner through the use of techniques…
Bader, D A; Moret, B M; Yan, M
2001-01-01
Hannenhalli and Pevzner gave the first polynomial-time algorithm for computing the inversion distance between two signed permutations, as part of the larger task of determining the shortest sequence of inversions needed to transform one permutation into the other. Their algorithm (restricted to distance calculation) proceeds in two stages: in the first stage, the overlap graph induced by the permutation is decomposed into connected components; then, in the second stage, certain graph structures (hurdles and others) are identified. Berman and Hannenhalli avoided the explicit computation of the overlap graph and gave an O(nalpha(n)) algorithm, based on a Union-Find structure, to find its connected components, where alpha is the inverse Ackerman function. Since for all practical purposes alpha(n) is a constant no larger than four, this algorithm has been the fastest practical algorithm to date. In this paper, we present a new linear-time algorithm for computing the connected components, which is more efficient than that of Berman and Hannenhalli in both theory and practice. Our algorithm uses only a stack and is very easy to implement. We give the results of computational experiments over a large range of permutation pairs produced through simulated evolution; our experiments show a speed-up by a factor of 2 to 5 in the computation of the connected components and by a factor of 1.3 to 2 in the overall distance computation.
NASA Astrophysics Data System (ADS)
Rueda, Antonio J.; Noguera, José M.; Luque, Adrián
2016-02-01
In recent years GPU computing has gained wide acceptance as a simple low-cost solution for speeding up computationally expensive processing in many scientific and engineering applications. However, in most cases accelerating a traditional CPU implementation for a GPU is a non-trivial task that requires a thorough refactorization of the code and specific optimizations that depend on the architecture of the device. OpenACC is a promising technology that aims at reducing the effort required to accelerate C/C++/Fortran code on an attached multicore device. Virtually with this technology the CPU code only has to be augmented with a few compiler directives to identify the areas to be accelerated and the way in which data has to be moved between the CPU and GPU. Its potential benefits are multiple: better code readability, less development time, lower risk of errors and less dependency on the underlying architecture and future evolution of the GPU technology. Our aim with this work is to evaluate the pros and cons of using OpenACC against native GPU implementations in computationally expensive hydrological applications, using the classic D8 algorithm of O'Callaghan and Mark for river network extraction as case-study. We implemented the flow accumulation step of this algorithm in CPU, using OpenACC and two different CUDA versions, comparing the length and complexity of the code and its performance with different datasets. We advance that although OpenACC can not match the performance of a CUDA optimized implementation (×3.5 slower in average), it provides a significant performance improvement against a CPU implementation (×2-6) with by far a simpler code and less implementation effort.
On computational algorithms for real-valued continuous functions of several variables.
Sprecher, David
2014-11-01
The subject of this paper is algorithms for computing superpositions of real-valued continuous functions of several variables based on space-filling curves. The prototypes of these algorithms were based on Kolmogorov's dimension-reducing superpositions (Kolmogorov, 1957). Interest in these grew significantly with the discovery of Hecht-Nielsen that a version of Kolmogorov's formula has an interpretation as a feedforward neural network (Hecht-Nielse, 1987). These superpositions were constructed with devil's staircase-type functions to answer a question in functional complexity, rather than become computational algorithms, and their utility as an efficient computational tool turned out to be limited by the characteristics of space-filling curves that they determined. After discussing the link between the algorithms and these curves, this paper presents two algorithms for the case of two variables: one based on space-filling curves with worked out coding, and the Hilbert curve (Hilbert, 1891).
Vecharynski, Eugene; Yang, Chao; Pask, John E.
2015-06-01
We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix A. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of A. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo
2014-01-01
This report presents an example of the application of multi-criteria decision analysis to the selection of an architecture for a safety-critical distributed computer system. The design problem includes constraints on minimum system availability and integrity, and the decision is based on the optimal balance of power, weight and cost. The analysis process includes the generation of alternative architectures, evaluation of individual decision criteria, and the selection of an alternative based on overall value. In this example presented here, iterative application of the quantitative evaluation process made it possible to deliberately generate an alternative architecture that is superior to all others regardless of the relative importance of cost.
A comparison of computational methods and algorithms for the complex gamma function
NASA Technical Reports Server (NTRS)
Ng, E. W.
1974-01-01
A survey and comparison of some computational methods and algorithms for gamma and log-gamma functions of complex arguments are presented. Methods and algorithms reported include Chebyshev approximations, Pade expansion and Stirling's asymptotic series. The comparison leads to the conclusion that Algorithm 421 published in the Communications of ACM by H. Kuki is the best program either for individual application or for the inclusion in subroutine libraries.
Computational algorithms for discrete wavelet transform using fixed-size filter matrices
NASA Astrophysics Data System (ADS)
Baraniecki, Anna Z.; Karim, Salahadin O.
1992-07-01
This paper describes matrix based algorithms for computing wavelet transform representations with application to multiresolution analysis. Structure of the algorithm presented is well suited for programming purpose and also for the implementation on VLSI processors. By using overlap-add or overlap-save techniques, constant matrix size can be used to accommodate arbitrary data lengths. Performance of the algorithm described in this paper is illustrated by decomposing an image into details and smoothed components.
A New Computer Algorithm for Simultaneous Test Construction of Two-Stage and Multistage Testing.
ERIC Educational Resources Information Center
Wu, Ing-Long
2001-01-01
Presents two binary programming models with a special network structure that can be explored computationally for simultaneous test construction. Uses an efficient special purpose network algorithm to solve these models. An empirical study illustrates the approach. (SLD)
Fast computing global structural balance in signed networks based on memetic algorithm
NASA Astrophysics Data System (ADS)
Sun, Yixiang; Du, Haifeng; Gong, Maoguo; Ma, Lijia; Wang, Shanfeng
2014-12-01
Structural balance is a large area of study in signed networks, and it is intrinsically a global property of the whole network. Computing global structural balance in signed networks, which has attracted some attention in recent years, is to measure how unbalanced a signed network is and it is a nondeterministic polynomial-time hard problem. Many approaches are developed to compute global balance. However, the results obtained by them are partial and unsatisfactory. In this study, the computation of global structural balance is solved as an optimization problem by using the Memetic Algorithm. The optimization algorithm, named Meme-SB, is proposed to optimize an evaluation function, energy function, which is used to compute a distance to exact balance. Our proposed algorithm combines Genetic Algorithm and a greedy strategy as the local search procedure. Experiments on social and biological networks show the excellent effectiveness and efficiency of the proposed method.
Accelerated algorithm for computing the motion of solid particles suspended in fluid.
Ding, E J
2009-08-01
A fast algorithm for computing the motion of solid particles suspended in fluid is presented. The motion of solid particles suspended in Stokes flow can be calculated without fully calculating the fluid motion. When the steady-state simulation is sufficient, this algorithm can greatly accelerate the simulation of solid particle suspension in Stokes flow.
Assessing the Reliability of Computer Adaptive Testing Branching Algorithms Using HyperCAT.
ERIC Educational Resources Information Center
Shermis, Mark D.; And Others
The reliability of four branching algorithms commonly used in computer adaptive testing (CAT) was examined. These algorithms were: (1) maximum likelihood (MLE); (2) Bayesian; (3) modal Bayesian; and (4) crossover. Sixty-eight undergraduate college students were randomly assigned to one of the four conditions using the HyperCard-based CAT program,…
ERIC Educational Resources Information Center
Avancena, Aimee Theresa; Nishihara, Akinori; Vergara, John Paul
2012-01-01
This paper presents the online cognitive and algorithm tests, which were developed in order to determine if certain cognitive factors and fundamental algorithms correlate with the performance of students in their introductory computer science course. The tests were implemented among Management Information Systems majors from the Philippines and…
SequenceL: Automated Parallel Algorithms Derived from CSP-NT Computational Laws
NASA Technical Reports Server (NTRS)
Cooke, Daniel; Rushton, Nelson
2013-01-01
With the introduction of new parallel architectures like the cell and multicore chips from IBM, Intel, AMD, and ARM, as well as the petascale processing available for highend computing, a larger number of programmers will need to write parallel codes. Adding the parallel control structure to the sequence, selection, and iterative control constructs increases the complexity of code development, which often results in increased development costs and decreased reliability. SequenceL is a high-level programming language that is, a programming language that is closer to a human s way of thinking than to a machine s. Historically, high-level languages have resulted in decreased development costs and increased reliability, at the expense of performance. In recent applications at JSC and in industry, SequenceL has demonstrated the usual advantages of high-level programming in terms of low cost and high reliability. SequenceL programs, however, have run at speeds typically comparable with, and in many cases faster than, their counterparts written in C and C++ when run on single-core processors. Moreover, SequenceL is able to generate parallel executables automatically for multicore hardware, gaining parallel speedups without any extra effort from the programmer beyond what is required to write the sequen tial/singlecore code. A SequenceL-to-C++ translator has been developed that automatically renders readable multithreaded C++ from a combination of a SequenceL program and sample data input. The SequenceL language is based on two fundamental computational laws, Consume-Simplify- Produce (CSP) and Normalize-Trans - pose (NT), which enable it to automate the creation of parallel algorithms from high-level code that has no annotations of parallelism whatsoever. In our anecdotal experience, SequenceL development has been in every case less costly than development of the same algorithm in sequential (that is, single-core, single process) C or C++, and an order of magnitude less
Embedded assessment algorithms within home-based cognitive computer game exercises for elders.
Jimison, Holly; Pavel, Misha
2006-01-01
With the recent consumer interest in computer-based activities designed to improve cognitive performance, there is a growing need for scientific assessment algorithms to validate the potential contributions of cognitive exercises. In this paper, we present a novel methodology for incorporating dynamic cognitive assessment algorithms within computer games designed to enhance cognitive performance. We describe how this approach works for variety of computer applications and describe cognitive monitoring results for one of the computer game exercises. The real-time cognitive assessments also provide a control signal for adapting the difficulty of the game exercises and providing tailored help for elders of varying abilities.
Simple and Effective Algorithms: Computer-Adaptive Testing.
ERIC Educational Resources Information Center
Linacre, John Michael
Computer-adaptive testing (CAT) allows improved security, greater scoring accuracy, shorter testing periods, quicker availability of results, and reduced guessing and other undesirable test behavior. Simple approaches can be applied by the classroom teacher, or other content specialist, who possesses simple computer equipment and elementary…
New Effective Multithreaded Matching Algorithms
Manne, Fredrik; Halappanavar, Mahantesh
2014-05-19
Matching is an important combinatorial problem with a number of applications in areas such as community detection, sparse linear algebra, and network alignment. Since computing optimal matchings can be very time consuming, several fast approximation algorithms, both sequential and parallel, have been suggested. Common to the algorithms giving the best solutions is that they tend to be sequential by nature, while algorithms more suitable for parallel computation give solutions of less quality. We present a new simple 1 2 -approximation algorithm for the weighted matching problem. This algorithm is both faster than any other suggested sequential 1 2 -approximation algorithm on almost all inputs and also scales better than previous multithreaded algorithms. We further extend this to a general scalable multithreaded algorithm that computes matchings of weight comparable with the best sequential algorithms. The performance of the suggested algorithms is documented through extensive experiments on different multithreaded architectures.
Toward a scalable quantum computing architecture with mixed species ion chains
NASA Astrophysics Data System (ADS)
Wright, John; Auchter, Carolyn; Chou, Chen-Kuan; Graham, Richard D.; Noel, Thomas W.; Sakrejda, Tomasz; Zhou, Zichao; Blinov, Boris B.
2016-01-01
We report on progress toward implementing mixed ion species quantum information processing for a scalable ion-trap architecture. Mixed species chains may help solve several problems with scaling ion-trap quantum computation to large numbers of qubits. Initial temperature measurements of linear Coulomb crystals containing barium and ytterbium ions indicate that the mass difference does not significantly impede cooling at low ion numbers. Average motional occupation numbers are estimated to be bar{n} ≈ 130 quanta per mode for chains with small numbers of ions, which is within a factor of three of the Doppler limit for barium ions in our trap. We also discuss generation of ion-photon entanglement with barium ions with a fidelity of F ≥ 0.84 , which is an initial step towards remote ion-ion coupling in a more scalable quantum information architecture. Further, we are working to implement these techniques in surface traps in order to exercise greater control over ion chain ordering and positioning.
NASA Astrophysics Data System (ADS)
Ho, Kai Ming
2012-02-01
I will review some of our work in the computation of photonic crystals, focusing on our discovery of the photonic band gap in diamond structures. I will also describe our conception of the cut-and-paste genetic algorithm in materials discovery structure search and discuss applications of the algorithm from early studies of atomic clusters geometries to more recent applications for structures of surfaces, interfaces, nanowires, and bulk crystals.
Rosa, Massimiliano; Warsa, James S; Perks, Michael
2010-12-14
We have implemented a cell-wise, block-Gauss-Seidel (bGS) iterative algorithm, for the solution of the S{sub n} transport equations on the Roadrunner hybrid, parallel computer architecture. A compute node of this massively parallel machine comprises AMD Opteron cores that are linked to a Cell Broadband Engine{trademark} (Cell/B.E.). LAPACK routines have been ported to the Cell/B.E. in order to make use of its parallel Synergistic Processing Elements (SPEs). The bGS algorithm is based on the LU factorization and solution of a linear system that couples the fluxes for all S{sub n} angles and energy groups on a mesh cell. For every cell of a mesh that has been parallel decomposed on the higher-level Opteron processors, a linear system is transferred to the Cell/B.E. and the parallel LAPACK routines are used to compute a solution, which is then transferred back to the Opteron, where the rest of the computations for the S{sub n} transport problem take place. Compared to standard parallel machines, a hundred-fold speedup of the bGS was observed on the hybrid Roadrunner architecture. Numerical experiments with strong and weak parallel scaling demonstrate the bGS method is viable and compares favorably to full parallel sweeps (FPS) on two-dimensional, unstructured meshes when it is applied to optically thick, multi-material problems. As expected, however, it is not as efficient as FPS in optically thin problems.
A development architecture for serious games using BCI (brain computer interface) sensors.
Sung, Yunsick; Cho, Kyungeun; Um, Kyhyun
2012-11-12
Games that use brainwaves via brain-computer interface (BCI) devices, to improve brain functions are known as BCI serious games. Due to the difficulty of developing BCI serious games, various BCI engines and authoring tools are required, and these reduce the development time and cost. However, it is desirable to reduce the amount of technical knowledge of brain functions and BCI devices needed by game developers. Moreover, a systematic BCI serious game development process is required. In this paper, we present a methodology for the development of BCI serious games. We describe an architecture, authoring tools, and development process of the proposed methodology, and apply it to a game development approach for patients with mild cognitive impairment as an example. This application demonstrates that BCI serious games can be developed on the basis of expert-verified theories.
A cognitive architecture for adversary intent inferencing: structure of knowledge and computation
NASA Astrophysics Data System (ADS)
Santos, Eugene, Jr.
2003-09-01
Existing target-based and objectives-based ("strategy-to-task") approaches to mission planning do not explicitly address the adversary"s decision-making processes. Obviously, the adversary"s courses of action (COA) are influenced in a cause-and-effect manner by actions taken by friendly forces. Given the iterative/interleaved nature of actions taken by enemy and friendly forces, mission planning must clearly take adversarial decision making into account especially during concurrent mission planning and execution. Currently, adversarial behaviour with regards to cause-and-effect are difficult to account for within the framework of existing planning approaches. This paper describes a cognitive architecture for computationally modeling, predicting, and explaining adversarial behaviors and COAs and proposes an integrated framework for mission planning. Our framework fits naturally within the Effects-Based Operations (EBO) approach to mission planning.
A Development Architecture for Serious Games Using BCI (Brain Computer Interface) Sensors
Sung, Yunsick; Cho, Kyungeun; Um, Kyhyun
2012-01-01
Games that use brainwaves via brain–computer interface (BCI) devices, to improve brain functions are known as BCI serious games. Due to the difficulty of developing BCI serious games, various BCI engines and authoring tools are required, and these reduce the development time and cost. However, it is desirable to reduce the amount of technical knowledge of brain functions and BCI devices needed by game developers. Moreover, a systematic BCI serious game development process is required. In this paper, we present a methodology for the development of BCI serious games. We describe an architecture, authoring tools, and development process of the proposed methodology, and apply it to a game development approach for patients with mild cognitive impairment as an example. This application demonstrates that BCI serious games can be developed on the basis of expert-verified theories. PMID:23202227
Coupling Multi-Component Models with MPH on Distributed MemoryComputer Architectures
He, Yun; Ding, Chris
2005-03-24
A growing trend in developing large and complex applications on today's Teraflop scale computers is to integrate stand-alone and/or semi-independent program components into a comprehensive simulation package. One example is the Community Climate System Model which consists of atmosphere, ocean, land-surface and sea-ice components. Each component is semi-independent and has been developed at a different institution. We study how this multi-component, multi-executable application can run effectively on distributed memory architectures. For the first time, we clearly identify five effective execution modes and develop the MPH library to support application development utilizing these modes. MPH performs component-name registration, resource allocation and initial component handshaking in a flexible way.
Eichenberger, Alexandre E; Gschwind, Michael K; Gunnels, John A
2013-11-05
Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.
The direct examination of three-dimensional bone architecture in vitro by computed tomography.
Feldkamp, L A; Goldstein, S A; Parfitt, A M; Jesion, G; Kleerekoper, M
1989-02-01
We describe a new method for the direct examination of three-dimensional bone structure in vitro based on high-resolution computed tomography (CT). Unlike clinical CT, a three-dimensional reconstruction array is created directly, rather than a series of two-dimensional slices. All structural indices commonly determined from two-dimensional histologic sections can be obtained nondestructively from a large number of slices in each of three orthogonal directions. This permits a comprehensive description of structural variation within a specimen and greatly facilitates the study of structural anisotropy. A measure of three-dimensional connectivity (Euler number/tissue volume) has been determined for the first time in human cancellous bone and shown to correlate with several two-dimensional histomorphometric indices. The method has the potential for overcoming many of the limitations of current approaches to the study of bone architecture at the microscopic level.
A computational algorithm for crack determination: The multiple crack case
NASA Technical Reports Server (NTRS)
Bryan, Kurt; Vogelius, Michael
1992-01-01
An algorithm for recovering a collection of linear cracks in a homogeneous electrical conductor from boundary measurements of voltages induced by specified current fluxes is developed. The technique is a variation of Newton's method and is based on taking weighted averages of the boundary data. The method also adaptively changes the applied current flux at each iteration to maintain maximum sensitivity to the estimated locations of the cracks.
Experience with a Genetic Algorithm Implemented on a Multiprocessor Computer
NASA Technical Reports Server (NTRS)
Plassman, Gerald E.; Sobieszczanski-Sobieski, Jaroslaw
2000-01-01
Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the non-distributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent.
Computing gap free Pareto front approximations with stochastic search algorithms.
Schütze, Oliver; Laumanns, Marco; Tantar, Emilia; Coello, Carlos A Coello; Talbi, El-Ghazali
2010-01-01
Recently, a convergence proof of stochastic search algorithms toward finite size Pareto set approximations of continuous multi-objective optimization problems has been given. The focus was on obtaining a finite approximation that captures the entire solution set in some suitable sense, which was defined by the concept of epsilon-dominance. Though bounds on the quality of the limit approximation-which are entirely determined by the archiving strategy and the value of epsilon-have been obtained, the strategies do not guarantee to obtain a gap free approximation of the Pareto front. That is, such approximations A can reveal gaps in the sense that points f in the Pareto front can exist such that the distance of f to any image point F(a), a epsilon A, is "large." Since such gap free approximations are desirable in certain applications, and the related archiving strategies can be advantageous when memetic strategies are included in the search process, we are aiming in this work for such methods. We present two novel strategies that accomplish this task in the probabilistic sense and under mild assumptions on the stochastic search algorithm. In addition to the convergence proofs, we give some numerical results to visualize the behavior of the different archiving strategies. Finally, we demonstrate the potential for a possible hybridization of a given stochastic search algorithm with a particular local search strategy-multi-objective continuation methods-by showing that the concept of epsilon-dominance can be integrated into this approach in a suitable way.
NASA Technical Reports Server (NTRS)
Lee, C. S. G.; Chen, C. L.
1989-01-01
Two efficient mapping algorithms for scheduling the robot inverse dynamics computation consisting of m computational modules with precedence relationship to be executed on a multiprocessor system consisting of p identical homogeneous processors with processor and communication costs to achieve minimum computation time are presented. An objective function is defined in terms of the sum of the processor finishing time and the interprocessor communication time. The minimax optimization is performed on the objective function to obtain the best mapping. This mapping problem can be formulated as a combination of the graph partitioning and the scheduling problems; both have been known to be NP-complete. Thus, to speed up the searching for a solution, two heuristic algorithms were proposed to obtain fast but suboptimal mapping solutions. The first algorithm utilizes the level and the communication intensity of the task modules to construct an ordered priority list of ready modules and the module assignment is performed by a weighted bipartite matching algorithm. For a near-optimal mapping solution, the problem can be solved by the heuristic algorithm with simulated annealing. These proposed optimization algorithms can solve various large-scale problems within a reasonable time. Computer simulations were performed to evaluate and verify the performance and the validity of the proposed mapping algorithms. Finally, experiments for computing the inverse dynamics of a six-jointed PUMA-like manipulator based on the Newton-Euler dynamic equations were implemented on an NCUBE/ten hypercube computer to verify the proposed mapping algorithms. Computer simulation and experimental results are compared and discussed.
Howell, J.A.; Whiteson, R.
1992-08-01
Detection of anomalies in nuclear safeguards and computer security systems is a tedious and time-consuming task. It typically requires the examination of large amounts of data for unusual patterns of activity. Neural networks provide a flexible pattern-recognition capability that can easily be adapted for these purposes. In this paper, we discuss architectures for accomplishing this task.
Howell, J.A.; Whiteson, R.
1992-01-01
Detection of anomalies in nuclear safeguards and computer security systems is a tedious and time-consuming task. It typically requires the examination of large amounts of data for unusual patterns of activity. Neural networks provide a flexible pattern-recognition capability that can easily be adapted for these purposes. In this paper, we discuss architectures for accomplishing this task.
Williams, P.T.
1993-09-01
As the field of computational fluid dynamics (CFD) continues to mature, algorithms are required to exploit the most recent advances in approximation theory, numerical mathematics, computing architectures, and hardware. Meeting this requirement is particularly challenging in incompressible fluid mechanics, where primitive-variable CFD formulations that are robust, while also accurate and efficient in three dimensions, remain an elusive goal. This dissertation asserts that one key to accomplishing this goal is recognition of the dual role assumed by the pressure, i.e., a mechanism for instantaneously enforcing conservation of mass and a force in the mechanical balance law for conservation of momentum. Proving this assertion has motivated the development of a new, primitive-variable, incompressible, CFD algorithm called the Continuity Constraint Method (CCM). The theoretical basis for the CCM consists of a finite-element spatial semi-discretization of a Galerkin weak statement, equal-order interpolation for all state-variables, a 0-implicit time-integration scheme, and a quasi-Newton iterative procedure extended by a Taylor Weak Statement (TWS) formulation for dispersion error control. Original contributions to algorithmic theory include: (a) formulation of the unsteady evolution of the divergence error, (b) investigation of the role of non-smoothness in the discretized continuity-constraint function, (c) development of a uniformly H{sup 1} Galerkin weak statement for the Reynolds-averaged Navier-Stokes pressure Poisson equation, (d) derivation of physically and numerically well-posed boundary conditions, and (e) investigation of sparse data structures and iterative methods for solving the matrix algebra statements generated by the algorithm.
Algorithms for computing Minkowski operators and their application in differential games
NASA Astrophysics Data System (ADS)
Dvurechensky, P. E.; Ivanov, G. E.
2014-02-01
The Minkowski operators are considered, which extend the concepts of the Minkowski sum and difference to the case where one of the summands depends on an element of the other term. The properties of these operators are examined. Convolution methods of computer geometry and algorithms for computing the values of the Minkowski operators are developed. These algorithms are used to construct epsilon-optimal control strategies in a nonlinear differential game with a nonconvex target set. The errors of the proposed algorithms are estimated in detail. Numerical results for the conflicting control of a nonlinear pendulum are presented.
[A fast non-local means algorithm for denoising of computed tomography images].
Kang, Changqing; Cao, Wenping; Fang, Lei; Hua, Li; Cheng, Hong
2012-11-01
A fast non-local means image denoising algorithm is presented based on the single motif of existing computed tomography images in medical archiving systems. The algorithm is carried out in two steps of prepossessing and actual possessing. The sample neighborhood database is created via the data structure of locality sensitive hashing in the prepossessing stage. The CT image noise is removed by non-local means algorithm based on the sample neighborhoods accessed fast by locality sensitive hashing. The experimental results showed that the proposed algorithm could greatly reduce the execution time, as compared to NLM, and effectively preserved the image edges and details.
Banks, H Thomas; Hu, Shuhua; Joyner, Michele; Broido, Anna; Canter, Brandi; Gayvert, Kaitlyn; Link, Kathryn
2012-07-01
In this paper, we investigate three particular algorithms: a stochastic simulation algorithm (SSA), and explicit and implicit tau-leaping algorithms. To compare these methods, we used them to analyze two infection models: a Vancomycin-resistant enterococcus (VRE) infection model at the population level, and a Human Immunodeficiency Virus (HIV) within host infection model. While the first has a low species count and few transitions, the second is more complex with a comparable number of species involved. The relative efficiency of each algorithm is determined based on computational time and degree of precision required. The numerical results suggest that all three algorithms have the similar computational efficiency for the simpler VRE model, and the SSA is the best choice due to its simplicity and accuracy. In addition, we have found that with the larger and more complex HIV model, implementation and modification of tau-Leaping methods are preferred.
Noise filtering algorithm for the MFTF-B computer based control system
Minor, E.G.
1983-11-30
An algorithm to reduce the message traffic in the MFTF-B computer based control system is described. The algorithm filters analog inputs to the control system. Its purpose is to distinguish between changes in the inputs due to noise and changes due to significant variations in the quantity being monitored. Noise is rejected while significant changes are reported to the control system data base, thus keeping the data base updated with a minimum number of messages. The algorithm is memory efficient, requiring only four bytes of storage per analog channel, and computationally simple, requiring only subtraction and comparison. Quantitative analysis of the algorithm is presented for the case of additive Gaussian noise. It is shown that the algorithm is stable and tends toward the mean value of the monitored variable over a wide variety of additive noise distributions.
He, Lifeng; Chao, Yuyan
2015-09-01
Labeling connected components and calculating the Euler number in a binary image are two fundamental processes for computer vision and pattern recognition. This paper presents an ingenious method for identifying a hole in a binary image in the first scan of connected-component labeling. Our algorithm can perform connected component labeling and Euler number computing simultaneously, and it can also calculate the connected component (object) number and the hole number efficiently. The additional cost for calculating the hole number is only O(H) , where H is the hole number in the image. Our algorithm can be implemented almost in the same way as a conventional equivalent-label-set-based connected-component labeling algorithm. We prove the correctness of our algorithm and use experimental results for various kinds of images to demonstrate the power of our algorithm.
NASA Astrophysics Data System (ADS)
Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming
2016-04-01
This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features.
Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming
2016-01-01
This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features. PMID:27079888
Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming
2016-04-15
This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features.
Cheng, Jie-Zhi; Ni, Dong; Chou, Yi-Hong; Qin, Jing; Tiu, Chui-Mei; Chang, Yeun-Chung; Huang, Chiun-Sheng; Shen, Dinggang; Chen, Chung-Ming
2016-01-01
This paper performs a comprehensive study on the deep-learning-based computer-aided diagnosis (CADx) for the differential diagnosis of benign and malignant nodules/lesions by avoiding the potential errors caused by inaccurate image processing results (e.g., boundary segmentation), as well as the classification bias resulting from a less robust feature set, as involved in most conventional CADx algorithms. Specifically, the stacked denoising auto-encoder (SDAE) is exploited on the two CADx applications for the differentiation of breast ultrasound lesions and lung CT nodules. The SDAE architecture is well equipped with the automatic feature exploration mechanism and noise tolerance advantage, and hence may be suitable to deal with the intrinsically noisy property of medical image data from various imaging modalities. To show the outperformance of SDAE-based CADx over the conventional scheme, two latest conventional CADx algorithms are implemented for comparison. 10 times of 10-fold cross-validations are conducted to illustrate the efficacy of the SDAE-based CADx algorithm. The experimental results show the significant performance boost by the SDAE-based CADx algorithm over the two conventional methods, suggesting that deep learning techniques can potentially change the design paradigm of the CADx systems without the need of explicit design and selection of problem-oriented features. PMID:27079888
Development and application of unified algorithms for problems in computational science
NASA Technical Reports Server (NTRS)
Shankar, Vijaya; Chakravarthy, Sukumar
1987-01-01
A framework is presented for developing computationally unified numerical algorithms for solving nonlinear equations that arise in modeling various problems in mathematical physics. The concept of computational unification is an attempt to encompass efficient solution procedures for computing various nonlinear phenomena that may occur in a given problem. For example, in Computational Fluid Dynamics (CFD), a unified algorithm will be one that allows for solutions to subsonic (elliptic), transonic (mixed elliptic-hyperbolic), and supersonic (hyperbolic) flows for both steady and unsteady problems. The objectives are: development of superior unified algorithms emphasizing accuracy and efficiency aspects; development of codes based on selected algorithms leading to validation; application of mature codes to realistic problems; and extension/application of CFD-based algorithms to problems in other areas of mathematical physics. The ultimate objective is to achieve integration of multidisciplinary technologies to enhance synergism in the design process through computational simulation. Specific unified algorithms for a hierarchy of gas dynamics equations and their applications to two other areas: electromagnetic scattering, and laser-materials interaction accounting for melting.
ERIC Educational Resources Information Center
Farid, Ayman A.; Zaghloul, Weaam M.; Dewidar, Khaled M.
2014-01-01
The great shift in sustainability and computer aided design in the field of architecture caused a remarkable change in the architecture philosophy, new aspects of beauty and aesthetic values are being introduced, and traditional definitions for beauty cannot fully cover this aspects, which causes a gap between; new architecture works criticism and…
Efficient computer algorithms for infrared astronomy data processing
NASA Technical Reports Server (NTRS)
Pelzmann, R. F., Jr.
1976-01-01
Data processing techniques to be studied for use in infrared astronomy data analysis systems are outlined. Only data from space based telescope systems operating as survey instruments are considered. Resulting algorithms, and in some cases specific software, will be applicable for use with the infrared astronomy satellite (IRAS) and the shuttle infrared telescope facility (SIRTF). Operational tests made during the investigation use data from the celestial mapping program (CMP). The overall task differs from that involved in ground-based infrared telescope data reduction.
Control optimization, stabilization and computer algorithms for aircraft applications
NASA Technical Reports Server (NTRS)
Athans, M. (Editor); Willsky, A. S. (Editor)
1982-01-01
The analysis and design of complex multivariable reliable control systems are considered. High performance and fault tolerant aircraft systems are the objectives. A preliminary feasibility study of the design of a lateral control system for a VTOL aircraft that is to land on a DD963 class destroyer under high sea state conditions is provided. Progress in the following areas is summarized: (1) VTOL control system design studies; (2) robust multivariable control system synthesis; (3) adaptive control systems; (4) failure detection algorithms; and (5) fault tolerant optimal control theory.
Dynamic programming and graph algorithms in computer vision.
Felzenszwalb, Pedro F; Zabih, Ramin
2011-04-01
Optimization is a powerful paradigm for expressing and solving problems in a wide range of areas, and has been successfully applied to many vision problems. Discrete optimization techniques are especially interesting since, by carefully exploiting problem structure, they often provide nontrivial guarantees concerning solution quality. In this paper, we review dynamic programming and graph algorithms, and discuss representative examples of how these discrete optimization techniques have been applied to some classical vision problems. We focus on the low-level vision problem of stereo, the mid-level problem of interactive object segmentation, and the high-level problem of model-based recognition.
Smart sensing to drive real-time loads scheduling algorithm in a domotic architecture
NASA Astrophysics Data System (ADS)
Santamaria, Amilcare Francesco; Raimondo, Pierfrancesco; De Rango, Floriano; Vaccaro, Andrea
2014-05-01
Nowadays the focus on power consumption represent a very important factor regarding the reduction of power consumption with correlated costs and the environmental sustainability problems. Automatic control load based on power consumption and use cycle represents the optimal solution to costs restraint. The purpose of these systems is to modulate the power request of electricity avoiding an unorganized work of the loads, using intelligent techniques to manage them based on real time scheduling algorithms. The goal is to coordinate a set of electrical loads to optimize energy costs and consumptions based on the stipulated contract terms. The proposed algorithm use two new main notions: priority driven loads and smart scheduling loads. The priority driven loads can be turned off (stand by) according to a priority policy established by the user if the consumption exceed a defined threshold, on the contrary smart scheduling loads are scheduled in a particular way to don't stop their Life Cycle (LC) safeguarding the devices functions or allowing the user to freely use the devices without the risk of exceeding the power threshold. The algorithm, using these two kind of notions and taking into account user requirements, manages loads activation and deactivation allowing the completion their operation cycle without exceeding the consumption threshold in an off-peak time range according to the electricity fare. This kind of logic is inspired by industrial lean manufacturing which focus is to minimize any kind of power waste optimizing the available resources.
Application of a Smart Parachute Release Algorithm to the CPAS Test Architecture
NASA Technical Reports Server (NTRS)
Bledsoe, Kristin
2013-01-01
One of the primary test vehicles for the Capsule Parachute Assembly System (CPAS) is the Parachute Test Vehicle (PTV), a capsule shaped structure similar to the Orion design but truncated to fit in the cargo area of a C-17 aircraft. The PTV has a full Orion-like parachute compartment and similar aerodynamics; however, because of the single point attachment of the CPAS parachutes and the lack of Orion-like Reaction Control System (RCS), the PTV has the potential to reach significant body rates. High body rates at the time of the Drogue release may cause the PTV to flip while the parachutes deploy, which may result in the severing of the Pilot or Main risers. In order to prevent high rates at the time of Drogue release, a "smart release" algorithm was implemented in the PTV avionics system. This algorithm, which was developed for the Orion Flight system, triggers the Drogue parachute release when the body rates are near a minimum. This paper discusses the development and testing of the smart release algorithm; its implementation in the PTV avionics and the pretest simulation; and the results of its use on two CPAS tests.
Algorithmic mechanisms for reliable crowdsourcing computation under collusion.
Fernández Anta, Antonio; Georgiou, Chryssis; Mosteiro, Miguel A; Pareja, Daniel
2015-01-01
We consider a computing system where a master processor assigns a task for execution to worker processors that may collude. We model the workers' decision of whether to comply (compute the task) or not (return a bogus result to save the computation cost) as a game among workers. That is, we assume that workers are rational in a game-theoretic sense. We identify analytically the parameter conditions for a unique Nash Equilibrium where the master obtains the correct result. We also evaluate experimentally mixed equilibria aiming to attain better reliability-profit trade-offs. For a wide range of parameter values that may be used in practice, our simulations show that, in fact, both master and workers are better off using a pure equilibrium where no worker cheats, even under collusion, and even for colluding behaviors that involve deviating from the game. PMID:25793524
A Computationally Efficient Algorithm for Aerosol Phase Equilibrium
Zaveri, Rahul A.; Easter, Richard C.; Peters, Len K.; Wexler, Anthony S.
2004-10-04
Three-dimensional models of atmospheric inorganic aerosols need an accurate yet computationally efficient thermodynamic module that is repeatedly used to compute internal aerosol phase state equilibrium. In this paper, we describe the development and evaluation of a computationally efficient numerical solver called MESA (Multicomponent Equilibrium Solver for Aerosols). The unique formulation of MESA allows iteration of all the equilibrium equations simultaneously while maintaining overall mass conservation and electroneutrality in both the solid and liquid phases. MESA is unconditionally stable, shows robust convergence, and typically requires only 10 to 20 single-level iterations (where all activity coefficients and aerosol water content are updated) per internal aerosol phase equilibrium calculation. Accuracy of MESA is comparable to that of the highly accurate Aerosol Inorganics Model (AIM), which uses a rigorous Gibbs free energy minimization approach. Performance evaluation will be presented for a number of complex multicomponent mixtures commonly found in urban and marine tropospheric aerosols.
Algorithmic Mechanisms for Reliable Crowdsourcing Computation under Collusion
Fernández Anta, Antonio; Georgiou, Chryssis; Mosteiro, Miguel A.; Pareja, Daniel
2015-01-01
We consider a computing system where a master processor assigns a task for execution to worker processors that may collude. We model the workers’ decision of whether to comply (compute the task) or not (return a bogus result to save the computation cost) as a game among workers. That is, we assume that workers are rational in a game-theoretic sense. We identify analytically the parameter conditions for a unique Nash Equilibrium where the master obtains the correct result. We also evaluate experimentally mixed equilibria aiming to attain better reliability-profit trade-offs. For a wide range of parameter values that may be used in practice, our simulations show that, in fact, both master and workers are better off using a pure equilibrium where no worker cheats, even under collusion, and even for colluding behaviors that involve deviating from the game. PMID:25793524
Cognitive Correlates of Performance in Algorithms in a Computer Science Course for High School
ERIC Educational Resources Information Center
Avancena, Aimee Theresa; Nishihara, Akinori
2014-01-01
Computer science for high school faces many challenging issues. One of these is whether the students possess the appropriate cognitive ability for learning the fundamentals of computer science. Online tests were created based on known cognitive factors and fundamental algorithms and were implemented among the second grade students in the…
A subspace preconditioning algorithm for eigenvector/eigenvalue computation
Bramble, J.H.; Knyazev, A.V.; Pasciak, J.E.
1996-12-31
We consider the problem of computing a modest number of the smallest eigenvalues along with orthogonal bases for the corresponding eigen-spaces of a symmetric positive definite matrix. In our applications, the dimension of a matrix is large and the cost of its inverting is prohibitive. In this paper, we shall develop an effective parallelizable technique for computing these eigenvalues and eigenvectors utilizing subspace iteration and preconditioning. Estimates will be provided which show that the preconditioned method converges linearly and uniformly in the matrix dimension when used with a uniform preconditioner under the assumption that the approximating subspace is close enough to the span of desired eigenvectors.
Signal window minimum average error algorithm for multi-phase level computer-generated holograms
NASA Astrophysics Data System (ADS)
El Bouz, Marwa; Heggarty, Kevin
2000-06-01
This paper extends the article "Signal window minimum average error algorithm for computer-generated holograms" (JOSA A 1998) to multi-phase level CGHs. We show that using the same rule for calculating the complex error diffusion weights, iterative-algorithm-like low-error signal windows can be obtained for any window shape or position (on- or off-axis) and any number of CGH phase levels. Important algorithm parameters such as amplitude normalisation level and phase freedom diffusers are described and investigated to optimize the algorithm. We show that, combined with a suitable diffuser, the algorithm makes feasible the calculation of high performance CGHs far larger than currently practical with iterative algorithms yet now realisable with modern fabrication techniques. Preliminary experimental optical reconstructions are presented.
Advanced Architectures for Astrophysical Supercomputing
NASA Astrophysics Data System (ADS)
Barsdell, B. R.; Barnes, D. G.; Fluke, C. J.
2010-12-01
Astronomers have come to rely on the increasing performance of computers to reduce, analyze, simulate and visualize their data. In this environment, faster computation can mean more science outcomes or the opening up of new parameter spaces for investigation. If we are to avoid major issues when implementing codes on advanced architectures, it is important that we have a solid understanding of our algorithms. A recent addition to the high-performance computing scene that highlights this point is the graphics processing unit (GPU). The hardware originally designed for speeding-up graphics rendering in video games is now achieving speed-ups of O(100×) in general-purpose computation - performance that cannot be ignored. We are using a generalized approach, based on the analysis of astronomy algorithms, to identify the optimal problem-types and techniques for taking advantage of both current GPU hardware and future developments in computing architectures.
Fast parallel algorithms that compute transitive closure of a fuzzy relation
NASA Technical Reports Server (NTRS)
Kreinovich, Vladik YA.
1993-01-01
The notion of a transitive closure of a fuzzy relation is very useful for clustering in pattern recognition, for fuzzy databases, etc. The original algorithm proposed by L. Zadeh (1971) requires the computation time O(n(sup 4)), where n is the number of elements in the relation. In 1974, J. C. Dunn proposed a O(n(sup 2)) algorithm. Since we must compute n(n-1)/2 different values s(a, b) (a not equal to b) that represent the fuzzy relation, and we need at least one computational step to compute each of these values, we cannot compute all of them in less than O(n(sup 2)) steps. So, Dunn's algorithm is in this sense optimal. For small n, it is ok. However, for big n (e.g., for big databases), it is still a lot, so it would be desirable to decrease the computation time (this problem was formulated by J. Bezdek). Since this decrease cannot be done on a sequential computer, the only way to do it is to use a computer with several processors working in parallel. We show that on a parallel computer, transitive closure can be computed in time O((log(sub 2)(n))2).
A computational algorithm to predict shRNA potency
Marran, Krista; Zhou, Xin; Gordon, Assaf; Demerdash, Osama El; Wagenblast, Elvin; Kim, Sun; Fellmann, Christof; Hannon, Gregory J.
2014-01-01
The strength of conclusions drawn from RNAi-based studies is heavily influenced by the quality of tools used to elicit knockdown. Prior studies have developed algorithms to design siRNAs. However, to date, no established method has emerged to identify effective shRNAs, which have lower intracellular abundance than transfected siRNAs and undergo additional processing steps. We recently developed a multiplexed assay for identifying potent shRNAs and have used this method to generate ~250,000 shRNA efficacy data-points. Using these data, we developed shERWOOD, an algorithm capable of predicting, for any shRNA, the likelihood that it will elicit potent target knockdown. Combined with additional shRNA design strategies, shERWOOD allows the ab initio identification of potent shRNAs that target, specifically, the majority of each gene’s multiple transcripts. We have validated the performance of our shRNA designs using several orthogonal strategies and have constructed genome-wide collections of shRNAs for humans and mice based upon our approach. PMID:25435137
A low computational complexity algorithm for ECG signal compression.
Blanco-Velasco, Manuel; Cruz-Roldán, Fernando; López-Ferreras, Francisco; Bravo-Santos, Angel; Martínez-Muñoz, Damián
2004-09-01
In this work, a filter bank-based algorithm for electrocardiogram (ECG) signals compression is proposed. The new coder consists of three different stages. In the first one--the subband decomposition stage--we compare the performance of a nearly perfect reconstruction (N-PR) cosine-modulated filter bank with the wavelet packet (WP) technique. Both schemes use the same coding algorithm, thus permitting an effective comparison. The target of the comparison is the quality of the reconstructed signal, which must remain within predetermined accuracy limits. We employ the most widely used quality criterion for the compressed ECG: the percentage root-mean-square difference (PRD). It is complemented by means of the maximum amplitude error (MAX). The tests have been done for the 12 principal cardiac leads, and the amount of compression is evaluated by means of the mean number of bits per sample (MBPS) and the compression ratio (CR). The implementation cost for both the filter bank and the WP technique has also been studied. The results show that the N-PR cosine-modulated filter bank method outperforms the WP technique in both quality and efficiency. PMID:15271283
Chandrasekhar equations and computational algorithms for distributed parameter systems
NASA Technical Reports Server (NTRS)
Burns, J. A.; Ito, K.; Powers, R. K.
1984-01-01
The Chandrasekhar equations arising in optimal control problems for linear distributed parameter systems are considered. The equations are derived via approximation theory. This approach is used to obtain existence, uniqueness, and strong differentiability of the solutions and provides the basis for a convergent computation scheme for approximating feedback gain operators. A numerical example is presented to illustrate these ideas.
Robust Algorithm for Computing Statistical Stark Broadening of Spectral Lines
Iglesias, C A; Sonnad, V
2010-02-10
A method previously developed to solve large-scale linear systems is applied to statistical Stark broadened line shape calculations. The method is formally exact, numerically stable, and allows optimization of the integration over the quasi-static field to assure numerical accuracy. Furthermore, the method does not increase the computational effort and often can decrease it compared to the conventional approach.